Handling Backups for Rapid Resilience

The backup and restore system in SQL Server hasn't changed a great deal over the years despite a huge growth in the typical size of databases. When disaster strikes, and an important service is taken offline while a restore is performed, there is often time to reflect on whether it might be possible to design databases for a more rapid recovery of the most critical parts of a database application.

In my previous article, Designing Databases for Rapid Resilience, I suggested some ways to improve your disaster recovery strategy by ensuring a robust design of your databases.

In this article I would like to take the topic a bit further, and set out some of the opportunities that already exist, and some potential opportunities, for maintaining a database service in the face of unforeseeable events.

Even if it is true that ‘any backup is better than no backup’, there is still a lot to think about when it comes to improving your backup and recovery strategy.

Types of database objects…

SQL Server Backup isn’t discriminating. It saves everything that is required for the restoration of a database, whatever its value. To SQL Server, the backups are nothing more than a collection of 8kb pages which are written one next to the other in a backup file upon request.

Much of what is backed up isn’t data. In fact, only a small minority may be data. If you are using a lot of XML data, for example, and extracting data from it, the chances are that your indexes will dwarf size of the original data. These indexes are quick to generate, yet SQL Server has no sense of their relative value and so includes them in the backup. Because databases aren’t functional until they are all pulled out from the backup, and recovery is completed, it means that the service is likely to be down for longer.

This is the safest way of doing it, but it means that, unless you are taking strategic advantage of filegroups with Enterprise edition, the database cannot function at all until everything is restored. (In Standard edition we can also take advantage of restoring one filegroup at a time, but we have two serious disadvantages: an object cannot span over more than one filegroup and the entire database has to become OFFLINE while the filegroup is being restored. )

To put it simply, could any database be more quickly restored if only the data and essential database objects were backed up, the rest being recreated via a DDL/DRI script?

To consider what opportunities there may be, let’s look into what objects there are in a database. Each database has some of the following:

  • Heaps
  • Clustered indexes
  • Non-clustered indexes
  • Statistics
  • System data (users, settings, etc)
  • DMOs data

And all of these objects are persisted on disk, except for the DMOs – which are in-memory objects containing dynamically collected system data and can be queried as if they were database tables or functions.

I mention the DMOs here because they are still very important beyond a SQL Server instance restart or database restore even though their data is not persisted on disk. It would be useful to have the database performance data at the time of the last backup if a disaster happens.

The question of cost vs. value

Even though all of these objects are the building blocks of a database, each object can be measured in three aspects:

  • Performance value – what is the performance benefit when the object is present
  • Performance cost – what is the performance cost when the object is restored from backup
  • Importance – how vital it is to have the object in place in order to have a functional system on a minimum level

So, let’s look at each one of them:

  • Heaps – These cost a lot in I/O when restored from a backup, and are vital for the data (heaps are the database tables which have no clustered index)
  • Clustered indexes – These indexes cost a lot in I/O when restored from a backup, and are vital for the data availability and the query performance
  • Non-clustered indexes – This type of index costs a lot in I/O when restored from a backup, but are not vital for the data; They may be vital for performance. Of course, important question is if the non-clustered index is used to ensure data integrity, in the case of unique constraints, which makes the non-clustered index vital for the data.
  • Statistics – They have some cost in I/O, especially when they are rebuilt; and they are vital for good performance, but not for the data availability itself
  • System data (users, settings etc.) -This costs very little in I/O when restored from a backup and is vital for the integrity of a database. System tables are very important part of the backup showing what logins were created, what settings were adjusted and so on (for example: index stats rebuild setting overwritten after backup restore etc)
  • DMOs data – This data cannot be included in a backup together with the database, but DMOs data is important for performance debugging, and hence it is a really good idea to offload the DMO data to persisted storage together with the backing up of a database.

How do we backup each one of them?

As of the time of writing this article I am not aware of any tool or method for selective backup on a database object level. However, here are a few potential ways that I believe would really improve the real-world DBA’s working life and the speed of disaster-recovery:

  • I would like to have a way to backup only the heaps, clustered indexes and the system objects. If I could do this, then I would be able to restore the entire database much faster after a disaster, without having to perform that extra I/O for the non-clustered indexes. If this were feasible, I would prefer to have an easy way of generating scripts for only the non-clustered indexes and pairing them in the archive with the backups of the heaps, clustered indexes and the system objects. This way, I could restore the data quickly, bring it online and then immediately start restoring the non-clustered indexes by executing the scripts.
  • There is a way to use an SMO script to create and archive the source script of the indexes and also the statistics of a database; and this is a great way for performance tuning off-site without compromising the security of the data. This method is also great for performance tuning of VLDBs; let’s say we have a 1 Tb database and we need to back it up and restore it to a laptop for some performance tuning. We do not need to go and buy a 1 Tb hard drive; instead all we need is to script the schema, the indexes and the statistics and restore it to a laptop. The execution plans will be the same, even though the data is not present.
  • I would like to have an automated way to backup all metadata from DMOs related to a specific database with the click of a button, and keep it paired with the database backup. This would really help me keep track of performance data and will help maintain the databases by providing object usage statistics and so on.
  • I would like to have a way to determine the importance of objects (heaps and clustered indexes) and define which ones should go in a backup and which ones should not. For example, I would like to backup only the Sales and Customer data, but not the Vendor data, since only the first two are of critical importance to the business, and the last one can be restored later on.

In my previous article (“Designing Databases for Rapid Resilience“) I covered some of the benefits of partitioning and how it can help with the faster restore after a failure.

In reality, however, it is only the Enterprise edition which supports the splitting of the same object over multiple filegroups, which means that in the case of disaster, the filegroups can be restored in the order of importance.

In any other edition of SQL Server, we can store one object on one filegroup only, which takes care of the idea to restore certain objects first after a disaster has happened.

In either case, however, we cannot avoid the backing up of less important non-clustered indexes.

What this means is that we have to go through the I/O penalty of backing them up, and then sooner or later we will have to go through the I/O penalty of restoring them from a backup.

And when a disaster happens…

When a real disaster happens, we all hope that we have the most recent backups, that they have been tested and that they can be restored as fast as possible (ideally without any of the users or bosses experiencing any disruption).

In reality, if a disaster happens and part of the database is not available, what we can do is to backup all filegroups except for the damaged one.

Then the smart thing to do is to look in the cache to see if any data from the failed filegroup is still cached, so we can offload it to a temporary location.

Here is an example: if we have Table1 on UserFileGroup1 and if the disk where the UserFileGroup1 resides has failed, the chances are high for the data to be in cache if the data from Table1 has recently been worked with.

In this case we can query the data as if there was nothing wrong with the UserFileGroup1’s disk, and we can insert the queried data into a table residing on a different filegroup.

Of course, the chances of all the data being available in cache are slim, and they diminish as time passes by. This is why we should have a very smart and tested alert system and smartly designed and tested database backups.

Could a better way of restoring a database be used that has the strategy of restoring at least a limited ‘critical’ functionality more quickly and then allows full range of less vital services to be restored in background in an incremental way?

In versions of SQL Server other than Enterprise there are opportunities to design for rapid recovery. In any version of SQL Server, the possibility exists of creating database applications as a series of databases on the same instance that are backed up separately, so that the most vital one can be restored first, whilst the others, if damaged, are merely built from script with only static data, with their functionality disabled in software. Then, as each database is restored, its functionality can be restored. As well as requiring some clever scripting to automate the process, it cannot be achieved without planning and a careful analysis of the acceptable service level for each part of the application just as I described in the first part of this series.

To conclude

In conclusion, there are vitally important objects and not-so-important objects in every database and it would be very hard for a DBA to select only the relevant objects to be included in a backup; however, as described above, there are steps which can be taken to significantly minimize the impact of a database disaster.

As mentioned in the first part of the series, there are several ways to use smart database design to achieve great results; and as mentioned in this article, some ‘positive discrimination’ on an database object level could speed up the backup and restore processes, utilize system resources in a better way and most importantly – minimize the downtime of our database systems.