It’s a recovery strategy, not a backup strategy

Plan A, Plan B, Plan C
Comments 0

Share to social media

I’ve talked about it before; you shouldn’t have a backup strategy, you should have a recovery strategy. I can’t possibly care if my backups succeed if I’m not bothering to test that they can be restored. And if they can’t be restored then, both technically and practically, I don’t have backups.

In one of the systems I manage, they built a very simple “test restore” process long before I became involved. Every night, it would pull the full backup for each database, restore it on a test system, and run DBCC CHECKDB against it. It would alert on any failure, of course, but the primary purpose was to always be confident that the backups could, in fact, be restored.

The process could afford to be simple because it assumed the following would always be true:

  • All databases took full backups every night.
  • Each database was backed up to a single .bak file named <DatabaseName>.FULL.<YYYYMMDD>.bak.
  • Each database we were restoring this way consisted of a single data file (called DataFile) and a single log file (LogFile).
  • All databases were backed up to the same location (and in a subfolder exactly matching the database name), and all databases would always be able to fit in that location.
  • The biggest database, once restored, would always fit on a single drive on the test system.
  • All databases could be restored and CHECKDB‘d in under 24 hours.

There is a table that tells the code which databases to restore:

And another table to tell the code, for example, where to find backups, and where to move data/log files on restore:

Then the (vastly simplified) code looked something like this:

I’ll let you guess how many of those assumptions remain facts today. If you guessed zero, you’d be right. Life got complicated, the restore system is now also being used for other purposes, and source databases have grown significantly. I’d have to help adjust the code to deal with those changes, and this post is about how.

Here is their new reality relative to the “truths” above:

  • They started performing full backups weekly, with differential backups in between.
  • They started striping backups to multiple files.
  • Many databases had additional files and filegroups; even those that didn’t started having logical file names straying from the existing convention.
  • Different databases had to be backed up to different locations.
  • The biggest database outgrew the D:\ drive on the test system.
  • The biggest database also squeezed out all the others in terms of duration, so that they couldn’t all be backed up and CHECKDB‘d within a calendar day.

Some things did remain true, though:

  • They always produce backups WITH INIT, so we never have to figure out which file (backup set) we need within any given .bak file.
  • They don’t use features like FILESTREAM, FILETABLE, or In-Memory OLTP, so we never have to consider those differences. They do use full-text search, but there are no dedicated files for full-text catalogs involved in backup/restore testing.
  • They encrypt backups, but we’re going to leave that complication out for now.

Also, while backups might be striped to 8 or 16 files (or any number, really), they produce file names in a predictable format, e.g. <DatabaseName>.FULL.<YYYYMMDD>.<file number>.bak. Backups are generated using Ola Hallengren’s solution, so this is easy to do by passing in the right @FileName and @NumberOfFiles arguments:

We actually store that pattern in the (distributed) config table too, so we can change it in one place instead of in every SQL Server Agent backup job on every instance.

Similarly, I created table types for things we do often enough that we don’t want to define them over and over again. Because the test server is on a different subnet, and because the backups come from servers all over the network, we can’t always query msdb on a source instance to get backup information about a given database. So, instead, we rely on looking at the folder and finding the most recent file, then examining the output of these RESTORE commands:

RESTORE LABELONLY (for determining how many stripes exist for a given backup):

RESTORE FILELISTONLY (for determining the logical data and log files present in the backup):

We don’t need all of those columns, of course, but there’s no way to say:

Next, we added columns to the SourceDatabases table to indicate where each database sends its backups and on what day of the week the fulls run. Then we used snake sorting to distribute the backups and restores across days of the week as evenly as possible. Let’s say we have this:

With the table types in place and the knowledge of where and when to find backups, we could make the code a lot more dynamic and determine exactly how to restore each database – even if it has been striped to multiple backup files and contains multiple data files. If there are multiple data files, we can alternate them onto the two drives we have available – even though one drive is supposed to be for tempdb, it can serve as a temporary workaround as it has plenty of available space.

Imagine DB_A has 4 data files and a log file, and is backed up to 8 stripes, this command would restore the database using the following command (though maybe not quite as pretty, formatting-wise):

This seems like complicated code, but we think it is worth the investment because we don’t just want a backup strategy. We want to always be confident that we can restore from our backups. And the only way to be truly confident is to always be doing it – and by always improving this process, it makes us that much more prepared for dealing with a disaster recovery situation when it happens for real.

Future considerations

The code could, in theory, determine in advance if a database has grown enough to come close to or even exceed the available space of both available drives. In that case, maybe it would be time to add some storage (if possible) or move these operations to a bigger server.

We also want to add into the mix occasional full + differential + log restores, since the current process only really tests full backups and not necessarily that we could recover to a specific point in time if we had to. This would complicate the code slightly – we’d have to identify the latest differential, and all subsequent log backups (or at least the first one), in order to generate the right sequence of RESTORE commands. But, overall, the logic would remain unchanged.

And we may later have to change it if we can no longer test all the databases in a week. Maybe we only CHECKDB any given database every other week, or distribute all the databases across two weeks, or only test each database monthly, or build additional test servers and split the work that way. That’s not a tomorrow or next week problem, but it’s coming someday.

Further reading

Some other information, particularly if you get snagged on an older version and need to swap in alternative functions: