Reliable Storage Systems for SQL Server

By validating the IO path before commissioning the production database system, and performing ongoing validation through page checksums and DBCC checks, you can hopefully avoid data corruption altogether, or at least nip it in the bud. If corruption occurs, then you have to take the right decisions fast to deal with it. Rod Colledge explains how a pessimistic mindset can be an advantage

My previous article in this series, High Performance Storage Systems for SQL Server, focused on configuring storage systems to be capable of handling peak load. Of course, a high-performance system is useless unless it’s also reliable and corruption free. This article will focus on some important tasks required before commissioning storage systems for production use, and a number of ongoing tasks that can help ensure that we avoid data loss through corruption.

Validating the IO Path: SQLIOSIM

Before a SQL Server system is deployed to production, it should be subject to a series of standard tests, in much the same way an aircraft is checked before takeoff. One of the most important checks is to ensure the disk system is reliable and suitable for use by SQL Server. In the previous “High Performance” article, we discussed the SQLIO tool, and how it can be used to measure the throughput capacity of the disk system. Here, we’ll focus on another tool, SQLIOSIM, also available as a free download from Microsoft. The primary aim of SQLIOSIM is to expose any potential issues with the storage system that may lead to physical disk corruption.

It’s important to understand that the “storage system” includes not only the physical hard drives, but all of the system components involved in reading and writing data to/from the drives, such as the bios, firmware, read/write cache, controllers, switches, drivers, operating system and even virus scanners. Each of those components “touches” the data in some way and so has the capacity to corrupt the data. The combination of these components is commonly referred to as the “IO path”.

Some vendor tools will “spot check” individual components, but what’s really needed is an end-to-end test of the whole IO path which simulates the read/write patterns of SQL Server and ensures that data is consistently written and read under heavy load. That’s exactly what SQLIOSIM is designed to do, and I consider its use to be an essential pre-deployment step before commissioning any SQL Server system.

In a similar manner, it should also be used to validate changes to a production system, such as the upgrade of drivers and/or firmware versions, and such checks should ideally be performed in a load-test environment that is configured identically to the production environment.

Finally, SQLIOSIM is essential in any post-mortem analysis following any instances of corruption in a production environment. In the heat of production downtime, it’s all too easy to for the situation to descend into accusations and finger pointing (“It’s a hardware problem! No! It’s a SQL Server problem!”). SQLIOSIM serves as an independent and objective tool in pinpointing any problems that may have contributed to the corruption.

Ongoing Checks

Validating the IO path, before commissioning a SQL Server system for production use, is a great first step, but that’s all it is. It does not guarantee that your system will remain corruption-free during usage. As such, we need to implement a series of ongoing checks to ensure that we detect corruption as soon as possible, therefore limiting the damage. This article will focus on two such methods, DBCC CHECKDB and page checksums.

Page Checksums

First introduced in SQL Server 2005, page checksums detect unintentional changes to the data SQL Server passes to the IO path to persist to disk. It does this by computing checksums on data pages before they’re written to disk, and then recalculating those checksums the next time the pages are read from disk. SQL Server performs the checksum calculations on a page-by-page basis immediately before passing data to the IO chain (for disk writes), and immediately after receiving data from the IO chain (for disk reads). In a corruption-free environment, the checksums will be the same. If the checksums are different, then something in the IO chain, outside the control of SQL Server, has changed (corrupted) the data before/during/after the disk write.

Page checksums, set at an individual database level, are enabled by default on new databases created in SQL Server 2005 and later. Databases upgraded from earlier versions will need to be manually altered to enable checksums using the ALTER DATABASE command, for example:

Note that simply enabling the checksum option will not automatically detect IO path corruption issues; database pages need to be modified and re-read after enabling this option for any issues to be detected. A common technique to modify most pages after an upgrade is to rebuild all indexes. In addition to adding checksums, this will also ensure that the index statistics are created using the improved logic in the upgraded SQL Server version.

Errors 823 and 824

SQL Server reports corrupted pages in both the SQL Server error log and the application event log, using Error 824. It also adds a record to the suspect_pages table in the msdb database. Error 824 is a considered a logical IO error; that is, the page was successfully read from disk, but problems were detected with the checksum.

Error 823 is the physical version of this error; that is, the file was simply unable to be accessed because, for example, the disk was inaccessible. Both of these errors are high severity errors (level 24) and will be picked up, by default, by most monitoring systems.

Error 825: “A little-known sign of impending doom”

One important point to note about Errors 823 and 824 is that SQL Server will only report them after four successive failed attempts. For example, the checksum comparison can fail 3 times, and succeed the 4th time, and you won’t see error 824. In such cases, Error 825 will be raised; however, this error is classified as a low priority error (level 10). As such, unless you’re specifically monitoring it, you’ll probably miss it. In this blog post, Paul Randal (blog | twitter) famously refers to this condition as “A little-known sign of impending doom”.

In summary, Error 825 is a sure fire way of detecting that something is wrong with your IO path, and should be addressed as soon as possible. In the next article, we’ll address various monitoring techniques that can be used to detect errors, such as 825.

DBCC CHECKDB

In addition to regular backups and restores using the WITH CHECKSUM option, a great way of verifying all page checksums is to use the DBCC CHECKDB command.

Despite the addition of features such as page checksum, the DBCC CHECKDB command is still the premier technique used to establish maximum confidence in the integrity of the database. While page checksum detects differences in what was written and subsequently read, DBCC CHECKDB is much broader in scope, including checks for page linkage, system meta-data, index ordering, and more.

Many people shy away from running CHECKDB, usually for two stated reasons. First is simply the mistaken belief that it’s not needed (it is), and second is the length of time it can take to run, and the performance impact it can therefore have on databases which may already be close to the edge of acceptable performance.

Given the importance of what CHECKDB delivers, that is, maximum confidence in the integrity of the database, it’s important to find time to run it. Ideally, if the database is small, or the maintenance window large, you’ll able to run a full DBCC CHECKDB command each night. If this is not possible, then there are three main techniques, discussed next, for reducing its impact. Each of these techniques can be combined as part of a single solution, to suit your needs.

Use the PHYSICAL_ONLY Option

The WITH PHYSICAL_ONLY option of the DBCC CHECKDB command will considerably shorten the run time by avoiding the (extensive) logical checks performed by the full command. That’s not to discount the value of the logical checks; they’re still important. A good combination might be to run full checks on a semi-frequent basis, such as once a week, alongside nightly PHYSICAL_ONLY checks.

Run full checks against a restored copy of the database

If a database contains corrupt pages, then the corruption will be propagated into the backups, and also into the restored copy of the database. It follows that running a DBCC checks against a restored copy of the database will, indirectly, validate the original source data. As a result, a good architecture pattern is to regularly restore full database backups of mission critical databases into an alternate environment, such as onto a reporting server. These restored copies can then be used as the source for full DBCC checks.

Not only does this pattern offload the DBCC impact, but it also validates the backups and serves as an additional data access point for applications that may not require up-to-the-minute data. The power of this pattern is fully realized when it’s automated, and we’ll spend more time covering this in the next article.

Performing Partitioned Checks

Finally, DBCC checks can be run against individual tables (DBCC CHECKTABLE) or file groups (DBCC CHECKFILEGROUP). For very large databases, rather than running DBCC CHECKDB, consider running against a subset of tables or file groups over a number of nights.

When Disaster Strikes…

In the first article in this series, I stressed the importance of a pessimistic mindset when planning for disaster. Physical disk corruption is, unfortunately, not as rare as you might hope or think, and being prepared for it is obviously a critical part of being an effective DBA.

Validating the IO path before commissioning the production system, and performing ongoing validation through page checksums and DBCC checks, will go a long way towards ensuring corruption is avoided and/or caught before it becomes widespread, and results in significant data loss. However, should corruption occur, despite your best efforts to avoid it, then it’s important to make the right decisions in dealing with it. Here’s a very brief overview of the major recovery options;

Backups

Backups to a DBA are like oxygen to a scuba diver. Having easy access to a number of recent and validated backups is crucial in these situations; if all else fails, you have a known good point to go back to, and if backups are taken regularly, the data loss will be limited.

DBCC CHECKDB/PAGE

Knowing what is corrupted is an important step in deciding what to do about it, and the two main tools of the trade here are DBCC CHECKDB and DBCC PAGE. CHECKDB, covered earlier, will tell you the extent of the corruption and the location (pages) of the corruption. DBCC PAGE will allow you to peer into the contents of the PAGES either side of the corruption, a useful technique in establishing the extent of the damage. Rather than cover these tools in detail, I refer you to Rodney Landrum’s excellent Simple-talk.com article, Finding Data Corruption. Among other things, he covers how to analyse the corruption output from DBCC CHECKDB, the usage of DBCC PAGE, and how to determine if a simple index rebuild will fix the corruption.

PAGE Level Restore

Depending on the extent (and location) of corruption, a good recovery option to consider is restoring the individual pages that are corrupt, rather than the whole database. Page level restore, first introduced in SQL Server 2005, allows selected pages to be restored, and the enterprise editions of SQL Server allow this option to be performed online. Such an option, described in this Books Online article, is a much better alternative than a whole-of-database restore, in that it is both quick, and in the enterprise editions, the user impact is either nil, or very low.

REPAIR_ALLOW_DATA_LOSS

I’ve saved the best worst till last! This CHECKDB option removes corruption by removing (deleting) the corrupted pages. Often the first choice of inexperienced DBAs, the REPAIR_ALLOW_DATA_LOSS option of the DBCC command will do exactly that; ALLOW DATA LOSS. Depending on the extent of the damage, this option can lead to a large amount of data loss and so should be the absolute last option, and only used if no other recovery option is available.

Summary

The worst examples of data loss through physical corruption are those that ignore the basic principles we’ve covered in this article. Following these simple steps will protect you from the ravages of unwanted corruption;

  1. Validate the storage system before production use with SQLIOSIM,
  2. Enable page checksums,
  3. Run DBCC checks on a regular basis,
  4. Monitor for errors 823, 824 & 825,
  5. Validate changes to the storage system using using SQLIOSIM,
  6. Use SQLIOSIM as part of a port-mortem analysis following any actual corruption events,
  7. Backups, backups, backups!

Throughout this article, we’ve made reference a number of times to the importance of monitoring for certain errors. In the next article, we’ll cover this as part of a broader look at the importance of automation in dealing with large and complex environments.