The most critical task for all DBAs is to have a Backup and Recovery strategy that ensures, every day, that in the event of a disaster they can restore and recover any database, within acceptable limits for data loss and downtime. Even with all the required backups in place, it's easy to miss subtle failings in the overall plan that can, and eventually will, defeat your recovery plans.

In this, Lesson 2 of the series "Top 5 hard-earned lessons of a DBA" (Lesson 1 was The Case of the Missing Index) we tackle one such failing, which is lack of proper backup verification.

How CHECKSUM works

Whether there are 10 databases in our environment or 1000, we must ensure that all backups are valid and usable. Without good backups, we will be in a very bad spot when the time comes to bring some data back from the dead.

The default mechanism by which SQL Server detects corruption in data pages is via use of checksums. Whenever it writes a data page to disk it calculates a checksum and stores the value in the page header. Subsequently, when it reads that page into memory, it recalculates the checksum and compares it to the value stored in the header. If the values match, then no corruption occurred while the page was stored in I/O subsystem; if they don't match then SQL Server raises a corruption error, which might look something like this:

Msg 824, Level 24, State 2, Line 1
SQL Server detected a logical consistency-based I/O error: incorrect checksum (expected: 0x9a3e399c; actual: 0x9a14b99c).
It occurred during a read of page (1:184) in database ID 23 at offset 0x00000000170000 in file

'D:\SQLDATA\ForcingFailures.mdf'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.

BACKUP...WITH CHECKSUM for finding on-disk corruption

By enabling checksums during backup operations, as shown in Listing 1, we can ensure that every page on disk, to be included in the backup, is free from this form of corruption.

TO DISK= '<Backup_location>.bak'
Listing 1: Backup WITH CHECKSUM syntax

SQL Server will recalculate checksums as it reads each page to back up. If it finds a page that fails this test, the backup will fail. If the backup succeeds then the backup is valid...or maybe not.

Problems with in-memory corruption

In fact, occasionally, checksum validation can offer a false sense of security and has gotten many DBAs into trouble. It does not guarantee that a database backup is corruption free. The CHECKSUM only verifies that no corruption occurred between writing the data to disk and reading it back out. It does not detect in-memory corruption.

If a data page cached in memory is updated, and then some in-memory corruption occurs (perhaps due to a faulty memory chip) before the modified page is written back to disk, then when the time comes, SQL Server will simply calculate the checksum value on the corrupted page and write it to disk. When it reads that page during a subsequent BACKUP...WITH CHECKSUM operation, it will recalculate the checksum and, assuming no additional problems occurred in the I/O subsystem, the values will match and the corrupted page will be included in the backup.

" ensure that your data pages are totally corruption free... run DBCC CHECKDB on a regular basis."

Schedule regular restores and run DBCC CHECKDB with SQL Backup Pro.

The most effective way to ensure that your data pages are totally corruption free is to run DBCC CHECKDB on a regular basis. This tool fully validates page structure and content, will check the logical and physical integrity of all the objects in the specified database, and so will catch corruption as early as possible.

A multi-pronged Defense to Data Corruption

In short, the lesson for the DBA here is that he or she requires a multi-pronged defense to the threat of data corruption!

BACKUP...WITH CHECKSUM is a useful tool and I recommend its use. Enabling checksums (and other checks, such as torn page detection) will bring with it only a small CPU overhead and should only have a minimal impact on the speed of your backups.

In addition, schedule regular consistency checks for each database, using DBCC CHECKDB. Weekly checks are optimal, but on a server where the spare CPU and disk capacity are not readily available then monthly checks are better no checks. You could also restore databases from a server that cannot sustain weekly integrity checks to a secondary server to perform the maintenance there.

In addition, again, it's a great idea to perform some test restores. This may seem obvious, but there many DBAs simply assume that their backups are good and let them sit on the shelf. We don't need to restore every backup in the system to check its health, but doing random spot checks now and again is an easy way to gain peace of mind regarding future restores, and to hone the recovery strategy. Each week, choose a random database, and restore its last full backup. If that database is subject to differential and log backups as well, choose a point-in-time test restore that uses a full, differential and a few log backup files