7 Preventable Backup Errors

The loss of a company's data is often enough to put the company out of business; and yet backup errors are generally avoidable with the application of common sense rather than deep technical knowledge. Grant digs into memories of his long experience of giving forum advice, to come up with the most easily preventable backup errors.

The whole idea behind taking a backup of your database is to protect the business from loss. But what if you lose the backup? How much protection do you have now? All sorts of different events can cause you to lose your backup or for your backup to be corrupted. Many of them are completely random and outside your control. But there are a few, fairly common, events that are completely preventable through standard best practices.

The most important thing you can do for your backups is to test them by restoring them, but that’s a different discussion entirely (as a matter of fact, I wrote about it here).

File Backups

The Scenario: There’s a system outage. Someone has dropped a table or deleted massive amounts of critical data. It’s the first time it’s happened on any of the systems that you’re responsible for. You know that the systems team set up some sort of backup for all the files on your production servers. That’s a relief. So, you go to get the .MDF and .LDF files that have been backed up… Only they’re not there, or, they are there, but when you try to restore the database they don’t work.

What Happened: This frequently happens in systems where you have no one responsible for the database server at all or a really new data professional. It’s not common knowledge out in the real world that SQL Server requires you to run a particular type of backup. A file is a file, right? Yes, but, SQL Server takes out locks on all the database files. This can cause a file backup system to skip the database files. It will usually raise an error, but after two or three weeks of looking at the same error, people frequently just turn it off, skip those files, whatever, but you don’t have a backup of your database files, at all.

Some file backup systems can even figure out that a file is locked, but back it up anyway. Now, you’ll have your database files, but you still won’t be able to use them. First, you can’t run a RESTORE operation from database files; you can only ATTACH them to the server. So, you think, fine, I’ll just ATTACH them. But then you can’t. The reason for this is that the SQL Server engine manages transactions, and transactions control how data gets written to the disk. It’s entirely possible for your backup to capture the files but with uncommitted, half complete, transactions, and you can’t rollforward or rollback in an ATTACH unless you did a DETACH first, so that SQL Server can clean up the open transactions. These file backups are useless.

How Do You Prevent It: SQL Server has a method for creating backups, built right into the product, BACKUP DATABASE. Not only will it create a file (or files) that you can use to RESTORE the database from, but it also does it in a way that figures out all the transactions, so you don’t have to worry about that either. This introduction to BACKUP was published for SQL Server 2005, but is still largely applicable to 2008R2. This is the preferred mechanism for creating backups within SQL Server.

Drive Space

The Scenario: Suddenly, all the backups are failing. You’ve planned ahead and you have an alert that fires from the SQL Agent job that runs your backups, but everything is failing. Looking at the error log or in the error message you see “Insufficient disk space for backup” or some similar message.

What Happened: You’ve run out of space on the drive where you have your backups. It happens a lot. It’s extremely common. The causes are sometimes more difficult to nail down. The simplest cause is the one you have the least amount of control over. The database grew. Because of this growth, it needed more space for the backup. Other causes you can do things about. Some people like to create a new backup file for every backup, putting a date and time into the file name along with the database name so that it’s easy to tell when the backup was taken. But they don’t always take as much care with creating or maintaining a clean-up script that removes backups after a few days.

How Do You Prevent It: You have to monitor your drive space. You need to also monitor your database sizes. If they’re growing, so will your backups. You also need to be sure of your backup cleanup process. I’ve seen these fail so often, it’s worth putting another check in place that validates the age of backups and either sends you an alert or sends you an alert after it deletes the old backup files. Just test these processes so that you’re not deleting newer backups that you might need immediately, and be sure that you have older backups available from some off-site storage system.

Other than that, the best prevention here is to keep an eye on your disk space. Not only should you worry about the amount of space free, but you need to be cognizant of the rate of growth.

Transaction Log is Full

The Scenario: You get an alert that a transaction cannot complete because there is insufficient space in the log. You may also get an associated alert that the drive where the log is stored is full. The database is still allowing SELECT queries to run, but anything that modifies data is completely locked and the users are starting to notice. As an emergency measure, you run a full backup, which either fails because it can’t write to the log, as it needs to, or nothing is fixed. You then start trying to shrink the file, again, with no good results. You’re considering restarting the server.

What Happened: If you make no changes to your system, when you first start creating database, they will be created with the Recovery Model set to Full. There are three recovery models, Full, Bulk-logged, and Simple. To all intents and purposes, Full & Bulk-logged are the same, so for this short article, we’ll treat them as such. When your database is in Full Recovery, it’s operating under the assumption that you would like to restore it to a point in time in the event you need to restore the database. This means that it’s maintaining the log file with all your transactions, waiting for you to run a log backup. And you must run a log backup, independently from the full database backups that you’re already running. The full database backup will not affect the log. So you’ve been running without taking a log backup and the log as filled, if it’s a fixed file size, or grown and expanded to fill all the drive space available if it’s on auto-grow and you weren’t monitoring your drive space.

How Do You Prevent It: To resolve this situation, you must do one of two things. First, you can switch your Recovery Model to Simple. This means that transactions are still written to the log, but after being written, when a checkpoint occurs in the server, they are purged. Second, you need to set up a separate set of log backups. These need to run on a regular basis and the definition of regular depends on two things, the quantity of your transactions, and the limits of your business recovery requirements.

The thing about switching the recovery to simple that is frequently over looked is that it takes away your ability to restore your database to a point in time. The whole idea of having transaction logs that can be backed up is so that you can restore your database to a moment right before an outage. But in order to do that, you must have a good set of log backups and, where possible, a tail log backup. The tail backup is one that you take prior to bringing the machine offline, assuming it’s online. With these, you can restore a system right up to the last log backup, or, with the tail log, right up to any particular moment.

The question you need to ask the business prior to changing the recovery model to simple is, “How much data can we afford to lose?” If the answer is a day or more, switch your recovery to simple and this problem should go away. If the answer is, less than ½ a day, then you’re going to need log backups. How frequently they get run is a combination of the needs of the business and the number of transactions through your system. If the business wants to try to keep any loss to less than 10 minutes, that’s how often you’ll need to run the log backups. If, on the other hand, they’re ok with ½ hour, then you might be able to run your log backups that frequently. But, it’s possible that the transaction throughput will cause the logs to grow more than you would like, so you may need to adjust that down to every twenty minutes. It’s hard to say without seeing your system. Plan for some adjustments to the frequency until you get it locked in. Also, since you’re adding to your backup load, make sure you have a disk with plenty of space to place the backups on. You’ll need to keep all the log backups that occur from one full backup to the next full backup. After that, you can get rid of the previous set of logs and then start taking log backups for the next period.

Backing Up Across the Network

The Scenario: You’ve got backups in place, but they’re running extremely slowly. They’re running so slowly that sometimes you get timeout errors and the logs fail. Other times it’s running so slowly that SQL Server is affected and you have to kill the backup process in order to speed up the system. In this same situation you may also be seeing excessive load on your network.

What Happened: There can be many causes for this, but one of the most frequently encountered is running the backups across the network. Backing up to a local disk or to a SAN through a dedicated fiber channel is usually just as fast as the disk can handle it. But when you backup across the network, you run into all sorts of contention and bottlenecks, competing with Twitter, Facebook and everyone’s Words With Friends games. That contention causes backups to run slowly.

How Do You Prevent It: Don’t backup across the network. I realize it can be more difficult than that, especially if you’re in a crunch for disk space and there’s no budget for more. But there’s really little you can do about it. Where possible, backup to local disks. If space is really an issue, backup locally, then copy the file across the network. This two-step process might make for a longer overall backup routine, but the backups themselves are more likely to complete successfully, which is the important part.

Only Have Differential Backups Available

The Scenario: You’ve been happily running a differential backup each day. You’ve never hit a failure and everything is good. Now you’ve had the inevitable outage and you go to your differential backup and run a restore. Only thing is, you get an error: The log or differential backup cannot be restored because no files are ready to roll forward.

What Happened: Differential backups work only in combination with a full backup. Sometimes the full backup is lost and people attempt to restore the differential without. Or, sometimes there’s just a misunderstanding of how the differential backups work. It’s also possible that a full backup was done prior to running the differential, but the database was not left in the recovering state necessary to apply differential and log backups. Any one of these issues could have been the cause.

How Do You Prevent It: You must take full backups and use them as the basis for a restore process in combination with differential backups. Since you can’t simply restore a differential, this is the only way to make this work. Further, when you restore the full backup, it must be left in a recovering state in order to apply the differential. The script would look like this:

This will restore the database in question, but leave it in a recovering state. In that state a differential backup can be applied. If logs must be restored as well, then the differential restore must also leave the database in a recovery statement by using the NORECOVERY option.

Because differential backups are completely dependent on the full backup, it’s very important that you have a tested full backup available in order to work with differential backups. There is no shortcut around this requirement.

Breaking the Backup Chain

The Scenario: You’ve listened to all the advice and this time you’ve done things correctly. You have full backups, differential backups and log backups all running on monitored systems where you’re watching the drive space appropriately. An incident occurs and now you’re ready to run the restore. You grab your latest full backup from the appropriate location and run the restore leaving the database in the recovering state. Then, you run a restore using the latest differential backup, again leaving the database in the recovering state because you’re going to apply log backups. Only, instead of a successful restore, you get an error: This differential backup cannot be restored because the database has not been restored to the correct earlier state.

What Happened: Unknown to you, during the day, between when the full backups were run and the incident which required you to run a restore, a junior DBA had run a backup of the database in preparation for a production rollout, just in case. The differentials which were run, which you thought were based on the last, appropriately run, full backup were instead based on this ad hoc backup. Because of this, the differential restore operation, which knows which differential base it belongs to and keeps track of the log sequence number (LSN) that it started with, generated the error. You can see which base backup you should have by running a simple query in your database:

How Do You Prevent It: If you have a full set of differentials and logs and you’re planning on using them for restore operations, then you can’t take out-of-sequence full backups. You have two options that will allow you to get a backup. First, if you’re just taking a backup as a precaution during a rollout or update situation, instead of running a full backup, run a snapshot backup. They’re faster to create anyway and fully serve the purpose. Second, if you still want a full backup, then use the COPY_ONLY option. This will create a backup, but it will not update the differential base for that database. In fact, any backups using the COPY_ONLY option cannot be used as a differential base. COPY_ONLY also will not affect any of the log chains.

Restoring to the Wrong Server

The Scenario: You’re in a hurry because you’ve got three different people on the phone. You need to fix a security problem on the production server, troubleshoot a performance issue in development and restore a database in QA. You’ve got windows open to all three servers and you’re typing as quickly as possible to resolve all the issues. You’ve got the restore database all laid out and you run the restore command. You get a database is in use error. Typical. The QA team wants the database restored, but they’re logged in. It’s OK. If you change the options on the database to restricted user and put a rollback immediate on it, you can restore the database and they’ll just lose their connection. You hit the little execute button with the bang on it and right about then notice that you’re on the window that’s logged into production. Oops.

What Happened: You restored the database to the wrong server.

How Do You Prevent It: Strictly speaking, this isn’t a backup issue per se, but it is an extremely common event. Almost every DBA I’ve ever met has restored the wrong database to the wrong server at least once in their career. There is no sure fire way to prevent this issue. If you’ve got the appropriate privileges, you can make a mistake.

One of the best ways to mitigate this issue is to require a different login for production systems. If you set that up and request that your DBAs have to open a separate copy of SSMS using the Run As command, you can be fairly sure that you’ve got more separation between production and your other systems.

Another thing you can do to mitigate the problem is to never use the rollback immediate to kick users out of the system. Call them up and ask them to log out or use individual kill statements. Either way, extra effort is called for by you, which gives you one extra chance to notice that you’re on the wrong server.

Finally, there is a way to change the color of the information bar at the bottom of your query window in SSMS. It’s located within the “Connect to Database Engine” window in the “Connection Properties” tab. If you set this to a particular color on your machine for production servers, it makes it possible to know that you’re connected to a production system. There are also third party tools that do this in an even more sophisticated fashion.

Conclusion

Lots of things can affect your backups. Hardware failing, sunspots, electrical or magnetic impulses, who knows? But those are not the most common problems. The most common problems with backups are preventable simply by understanding how backups work and applying best practice processes to your backup routines. Put these common sense approaches to work and avoid the most common backup issues.