Confessions of a DBA: My worst mistake

Over the next few months, we'll be asking various well-known DBAs to describe their worst disaster caused by a mistake they made. To kick off the series, we asked Phil Factor to confess. He came up with a classic: The mistaken belief that a backup WITH CHECKSUM guaranteed a good backup that could be restored, and the ensuing disaster.

I suspect that pretty well all serving DBAs will have memories that will make them cringe with embarrassment. Even the dignified figures that give such authoritative presentations at PASS will once have made mistakes that, I imagine, will now make them whistle at the memory. Although my most embarassing mistake is already documented as The Computer that Swore, my worst mistake was the result of assuming that a backup WITH CHECKSUM guaranteed a good backup that could be restored. It looked good but I didn’t reckon on the possibility of a data page that gets corrupted whilst in memory and then written to disk with a valid checksum. Also, I thought then that all pages had checksums; they don’t. Now, of course, I’ve sat through a couple of Paul Randal’s presentations in which he explains this, and other ways that data can be backed up even though there are corruptions. The WITH CHECKSUM is fine to catch some errors but you really need to catch all errors that would cause a bad restore. No, there is no substitute for checking that a database can be restored by actually restoring it and running DBCC CheckDB on it.

I was developing a database for a web startup. It was an ambitious system for doing wholesale commodity trading, and it had a complex data model, and a lot of data, originally held messily in a number of different ‘legacy’ systems. The process of sorting out this ‘donkey’s breakfast’ of data was a nightmare, but the time came that I was able to release to production a functional application that allowed data entry. It had, at that time, around 150 concurrent users, but the comms bandwidth throttled this enough to make it manageable. I set up full recovery with a weekly full backup with ten minute transaction log backups. We’re going back a few years, as you’ll appreciate when I tell you it was a backup onto DAT tape. I could cycle the backups to go back six weeks just to be super-sure, and the tapes were recycled.

When I first got the ‘SQL Server Fatal Error Table Integrity Suspect’ error, I ran DBCC CheckDB and determined that a couple of tables were corrupted. Nowadays, we get excellent advice of what to do, but then it was different. Books Online at that time incorrectly advised a restart but it didn’t help. We all have days like this, so thinking that some sudden hardware failure had struck the machine, I restored onto another, clean, machine from backup. As the helpful text on BOL put it ‘Restore from the latest known good backup’. It failed, leaving the database in ‘suspect’ mode. Then, of course, came the long slow process of going back through the backups with DBCC tools to work out where the corruption happened. I got to the one that I’d done six weeks ago; the oldest surviving backup. It wouldn’t restore. I had no ‘known good’ backups. Quite the contrary, I knew I hadn’t. The data corruption had been there for some time, festering and growing until the point it triggered the fatal error

I scratched my head. It was an odd phrase ‘the last known, good backup’. How do you know for certain if you don’t do a restore? I’d naively assumed that I’d guaranteed that it would restore by backing up with checksum. Wrong. The only way that one could ‘know’ is to restore it, and then run DBCC CheckDB on it. It is an interesting thought, and no backup system will save you from having to do this task.

This is the sort of moment in one’s professional career that one briefly considers donning a false beard and immediately setting off for Australia under an assumed name. It was certainly a painful moment having to tell the company directors that the database would be offline for a while. There were not only traders but a lot of people inputting information. The painful process of determining what was wrong and putting it right is a fascinating tale but not relevant to this story. Suffice it to say that a skeletal service was restored within a day, and the database burst back into full life after two working days and a weekend. A hardware failure had, undetected, gradually increased the corruption of the data pages until it caused the fatal error.

I cannot really offer an excuse. The best I can do is to say that I didn’t have the time to do a routine checkDB on a restored backup, which was probably true, but nowadays one can script this check, and there are now third-party tools (e.g. SQL Backup Pro) that can cut down on the time required by running a DBCC CheckDB on what is, in reality, a backup rather than a restored database.

Even with the help of PowerShell, SMO and all the third-party tools, those DBAs with, say, a hundred production servers in their care will soon work out that the availability of time and hardware doesn’t permit a verification of every backup. Tom Larock came up with a useful technique of using statistical sampling to do the best possible attempt to run comprehensive checks.

One sideline in this story is the skeletal service. A valuable safety-net in the case of a major disaster is to be able to quickly provide a service that is sufficient to let the business continue. Each database is different, and it takes planning and rehearsal,  but I generally find that a small restricted subset of the service can provided quickly from a ‘skeletal’ database with only an essential subset of the data. I always have scripts and data standing by on the server ready for this remote eventuality, and fortunately did then.

Whatever else one should take from this story, the one I hope you’d take is the idea that the only high-availability strategy that makes sense is a ‘Restore’ strategy, rather than a Backup strategy.

Related Content: