Simple Talk is now part of the Redgate Community hub - find out why

Statistical Sampling for Verifying Database Backups

A DBA's huge workload can start to threaten best practices for data backup and recovery, but ingenuity, and an eye for a good tactic, can usually find a way. For Tom, the revelation about a solution came from eating crabs. Statistical sampling can be brought to bear to minimize the risk of failure of an emergency database restore.

I often hear about database administrators who focus on having a good backup strategy;  but most senior database administrators understand the need to have a good recovery strategy, of which taking backups is only one part. For all of that talk of recovery strategy, I have yet to see anyone discuss the importance of building a process that actually verifies that you can recover your data.

A good recovery strategy would include

  1. A Consistent backup schedule
  2. A test Restore of the databases
  3. a DBCC CHECKDB to look for corruption

So why is it that so few database administrators build a recovery strategy that includes routine testing of their backups to ensure they can be recovered? I think there are two main reasons:

Firstly, most are simply unaware of the need to have such a process in place. However, they will most certainly put such a process in place immediately after their first experience of a failure to perform a requested restore.

Secondly, there is the practical problem of volume. With each passing week, month, and year most database professionals are being asked to take on more and more responsibilities. In my shop, right now, I have over 3,000 databases under my care. The idea that I could restore each of them, every day, and run a DBCC CHECKDB is ludicrous.

So what do you do when you have a large number of backups to verify, and insufficient time to do them?

Enter the King Crab

1028-Crab.jpg

I have a friend who makes a living with King Crab. His job is to fly to remote locations such as Alaska and Norway and bid on thousands of pounds of shellfish. Besides the wonderful fishing stories that he has to share, he also has a few dozen pounds of fish on hand at any given time. Eating dinner at his home is always an affair to remember. Last time I was there when we ran out of crab at the table, he simply went downstairs to get another five pounds as easily as you or I would get another stick of butter.

I asked him one day about the details of his job. For example, how did he test thousands of pounds of fish at a time?

“Sampling,” he replied. “Since I cannot possibly test each crab I must rely on taking a sample.”

It was usually the samples that he took home for dinner. See, when you sample a crab you ruin it for any possible market value, but you don’t ruin the crab for eating. You don’t want to sample too many but you need to sacrifice a certain amount in order to place an appropriate bid on the lot. So how much is the right amount? Well, my friend uses a chart and a system of tiered sampling that he has put together over the years (which I cannot share with you here, unfortunately).

I recalled the concept of sampling from my days as a young mathematician. I also recalled that it was not the most exciting topic for discussion, as it often dealt with unfamiliar ideas such as pulling marbles from an urn, or taken a number of light bulbs out of a bin containing thousands. But here was my friend applying a mathematical concept in a way I never thought imaginable, or so delicious. And then I suddenly realized that I could make use of sampling as well in order to help me design an effective test of my ability to recover in the event of a disaster. Armed with this revelation I set about to refresh myself of some basic concepts: random sample, confidence level, and confidence interval.

Random Sample

Not to be confused with population, your sample is the subset of the population you are trying to observe. In my case, the ‘population’ is roughly 3,000 database backups. For my friend it is a few tons of King Crab. But no matter what your population size, the hardest part is ensuring that your sample is truly random. You simply must have a process in place to select a population member at random.

One of the most famous examples of this has to do with George Gallup being able to predict the 1936 Presidential Election with greater accuracy than other respected polling sources at the time. Gallup knew that a small sample of truly random voters would indeed serve well for predicting the outcome of an election. The problem for Gallup was getting a truly random sample: For example, in 1936 you would not resort to calling people listed in the phone book to conduct your survey  because a majority of Americans did not own phones.

Confidence Level

The Confidence Level is the level at which you express your confidence in your prediction that results of the tests you perform on the sample apply to the entire population, and is often defaulted to 95%. It is your confidence that your results did not occur by mere chance.  In fact, it is often understood that most surveys and polls are conducted with an assumed CL of 95%. Why that number? Why not 100%? Well, you wouldn’t think of sampling each and every crab, would you?

In a normal distribution of data, roughly 95% of the data points will fall between a lower and upper limit that is calculated according to…did I just catch you nodding off? OK, forget the basic stats class then, let’s focus on the real basics here.

At 95% certainty, you are also saying that the chance that the sample gave a different result from the entire population  was  1 in 20 times. In other words, going around saying you are 95% confident in your results just means that you have observed something happen with some frequency, but the results should never infer that you are 100% correct with your resulting assertions unless you test the entire population.)

Confidence Interval

This is your margin of error, usually presented as a +/- along with a number. If I said that I am 95% certain that 72% of all Americans enjoy crab fishing, +/- 3%, then the last part (+/- 3%) is the CI.

Which means you can just think of things this way instead: “I am 95% certain that 69%-75% of all Americans enjoy crab fishing.”

Why Are You Boring Me With This Today?

What’s my point? Where am I going with all of this? Those are great questions.

As a database administrator our number one priority is to be able to recover data in the event of a disaster. Now, when was the last time you did any verification regarding your database backups? Not that your backups ran, but the files they produced can be recovered and restored? Last week? Last month? Last year? And how many databases should you use as a random sample to be 95% confident that all your database backups can be restored and recovered?

Some people in smaller shops may build a process that will verify each and every database can be recovered, but that process is rather time consuming (as well as expensive) if you have thousands of files to verify.

How would you conduct such a test in your shop? For my 3,000 databases I will need to turn to sampling in order to help build a verification process. But how many restores should I perform? I could go with a tiered sampling approach, same as my friend, but first I want to perform some basic calculations to determine what a true sample size would be for my population of 3,000 databases.

BI Power

No, this has nothing to do with Business Intelligence, this is a different kind of BI, as in bi-nomial distribution. You use a binomial distribution when you have two possible outcomes. In this case, either the backup file can be restored, or not. So we get to work using that information, along with the fact that we have a total population of 3,000 databases.

But the problem with this is that we don’t really know the error rate. We assume that we can recover from any one of the 3,000 backups at any time. But if we were pressed to identify a failure (or, defect) rate, what would we choose? One percent? Less than one percent? It would seem that what we really want to do is to calculate the sample size needed in order to satisfy our assumptions. After all, if we are going to guess at our success rate anyway, we might as well figure out how many backups we should have in our sample size.

This is actually easier than you might think. Sure, there is some math involved, but not much, and I’ll make it easy for you to understand.

  1. Take your entire population, call it n, and set this number aside (for us, it is n=3,000).
  2. Take your confidence interval, call it E, and set that number aside (for us, we will go with a very small margin of error, using +/- 1%).
  3. Now, set your rate of failure, call it p, and set that number aside (for us, we will assume we are nearly perfect and go with p=.001, which means we have a failure for one in one thousand attempts).
  4. Lastly, we will assume that we have a 95% confidence level, otherwise I will have to teach you how to use a Z-table and frankly we don’t have enough time for that now.

OK, here is the secret sauce you can use to estimate this for yourself.

Take your value for p and multiply it by (1-p). In our case, this is (.001)x(.999)=.000999

Take the number 1.96 and divide by your confidence interval. In our case, this is (1.96)/(.01)=196. (Why 1.96? Because we assumed a 95% confidence level. it’s either you trust me on that number or you learn to read the Z-table for yourself. Trust me, it’s better this way). Now, take that number (196) and multiply by itself. So, (196)x(196)=38416.

Take your two numbers, .000999 and 38416, and multiply them together. For our case, we get 38.377584, and we go to the next highest integer and say that our necessary sample size would be 39. For those of you that want some T-SQL to do your math for you I can offer this:

I should mention that typically when someone does not know their failure rate they use a default value of 50%, or 0.5, because that would give you the highest possible sample size as an initial working set. Again, we are assuming that our failure rate is much lower because…well…because if it wasn’t then we would probably be seeing frequent errors all week long as we restore production databases down to test and development servers.

Finding Random Backups

Using the numbers from above we have deduced that we need to select 39 random database backup files to use for the process of testing our ability to recover. Before I can even think about getting a random sample selected I need to think about how I want to actually conduct my testing.

One way is to do a restore to the existing development or test servers. However that may not always be possible due to one limitation or another. Two common examples would be when you are not allowed to overwrite an existing development or test database and you do not have enough space on disk to keep two distinct copies of the same database.

The answer, I believe, lies with virtualization. I would recommend that you find a dedicated server to use just for this purpose. That way you don’t have to worry about interrupting anyone’s work. While disk space could possibly be an issue you really should not be concerned too much about speed. As long as it doesn’t take all day to get your restores done, of course!  And you do not need to keep all of the databases active, you can drop them right after the restore completes. However, we are only checking to see if the restore can complete. If someone wants to verify the actual data inside then you’ll need to build that requirement into your overall process.

If we assume that you can use a dedicated server the next question becomes: how do I randomly select my databases? The answer lies in a two step process. First step is to get a list of your databases. This can be done in a myriad of ways. My preference would be to either write a query against the Operations Manager database or to query all of the servers I have registered in my Central Management Server. I will leave it up to you as to how you want to put together your list. The end result should be three columns: server name, database name, and a unique number from 1 to n, where n is the total number of databases (or, put your list into excel and just look at the row number).

The second step is to pick numbers at random. That part is rather easy to do using a few lines of T-SQL:

The above code will generate ten random numbers between 1 and 3000. Feel free to modify it for your own needs. In my case I would need to loop 39 times, for example.

And if you want to get really fancy, then throw in the element of time. Everything I have written about is essentially discussing your ability to recover from a database backup file that still exists on disk. If you want you could find a way to randomly select a database at a point in time, say three months prior, and see if you are able to recover.

And if you want to get really, really fancy then in before you drop that database you just restored you would include a step to run a DBCC CHECKDB command, just to be safe.

How Many Do I Really Need?

If you recall, my friend used tiered sampling in order to test the crab that his company intended to purchase. What that means is that if he had 3,000 pounds of crab he would not think about sampling 39 pounds for two reasons. First, because that would take him too long, and a second reason is because every pound he samples is a pound that cannot be sold, meaning someone loses money.

The tiered approach essentially means that you can start with a smaller sample size. If you find a failure rate to be higher than acceptable you go back and collect a larger sample size for a second round of testing. You could decide to go with restoring ten random databases and if you find that one fails then you increase your sample size to 25.

What if I error again?

In a tiered approach, if you get more errors than expected, you grab a new set. If you find more errors in that new set, then you reject all backups. And by that I mean you will want to rerun your backups. But more importantly you’ll first need to know exactly why they failed to begin with. For example, you may find that you have an issue with just one particular server, so you may just need to rerun the backups for that server as opposed to all servers. But the fact remains that you will need to do some investigation as to the root cause of the failures.

Summary

Testing your database recovery process is not something that should be overlooked or ignored by anyone, in any sized shop. As a database administrator you are going to be relied upon to recover data. In order to achieve that goal you will find yourself focusing on a good backup strategy first, followed by your recovery plan. All too often people tend to forget about putting in a series of controls in order to verify that their processes are working as expected.

For shops with hundreds to thousands of databases, you can use sampling techniques to help you be confident that you are able to recovery any database that you administer.

How you log in to Simple Talk has changed

We now use Redgate ID (RGID). If you already have an RGID, we’ll try to match it to your account. If not, we’ll create one for you and connect it.

This won’t sign you up to anything or add you to any mailing lists. You can see our full privacy policy here.

Continue

Simple Talk now uses Redgate ID

If you already have a Redgate ID (RGID), sign in using your existing RGID credentials. If not, you can create one on the next screen.

This won’t sign you up to anything or add you to any mailing lists. You can see our full privacy policy here.

Continue