Disaster Recovery for SQL Server Databases

High-Availability depends on how quickly you can recover a production system after an incident that has caused a failure. This requires planning, and documentation. If you get a Disaster Recovery Plan wrong, it can make an incident into a catastrophe for the business. Hugo Shebbeare discusses some essentials, describes a typical system, and provides sample documentation.

In this article, I’ll lay out the technical details of implementing a simple Disaster Recovery Plan (DRP) for production applications running Microsoft SQL Server. My goal is to provide you with generic documentation to use as the basis of your own production system failover strategy. You will, of course, need to alter it with your own details and keep it updated any time that changes are made to your production systems, but this should give you a good departure point from which to build your own strategy.

I’ll describe the steps to follow in the event of the failure of a database production system, and annotate the process as I go along. This is largely based on a Disaster Recovery Plan I had to design recently (all the better for you to download and personalize), so it is deliberately written in the style of a business strategy document. I’ll also explain the advantages of automatic restoration of compressed backup files from a failover server. I’ll also be the first to admit that this topic might seem a little dry, but having a DRS will make it worth the read – I promise.

Part 1 of this article will describe the basic steps necessary to set up a ‘hot’ standby server (the recovery method I used when drafting this DRP), and Part 2 is an annotated transcript of Disaster Recovery document, including steps to be taken in the event of a disaster, and information for the unfortunate DBA tasked with recovering from it. Here we go:


PART 1 – Automatic Restoration of backup files to a failover server

SQL Servers’ norecovery mode keeps the database stable and ready to accept the changes you’re progressively applying as part of the backup process. This means that it’s only necessary to apply the latest differential or log backup before the database is ready to be accessed by users and applications.

The disaster recovery method used is to have a ‘hot’ standby server (SQL2), which is already installed, stable and, most importantly, is an exact copy of the production server’s configuration. The standby server should already have the most recent operational databases fully-restored in norecovery mode.

Implementing a Hot Standby Server

After SQL Server has been installed on the failover server, you need to check that Robocopy is installed in the sysroot\windows\system32 folder. Secondly, Red Gate’s SQL Backup software must connect to the server and be configured by clicking the small grey square next to server listing in left pane – this is for instance auto-configuration, if it has not been done already.

791-SQB_installingservercomponents.gif

Figure 1 – SQL Backup’s auto-configuration system.

Robocopy is much better than ( the soon-to-be-discontinued) Xcopy, by the way. And since Windows Server 2003, Robocopy has been the recommended / future-proofed tool of choice. As far as I know, Xcopy will no longer be available in future versions of Windows Server.

Next, for the stored procedures that execute Robocopy (we place these procedures in a local database on each server called DBA_tools), you need to allow the advanced option xp_cmdshell to run:

In order to copy the backup files, each database on the standby server needs a database-specific SQL Server Agent job running Robocopy at the required interval to copy full and differential backups from the production server to the standby server. These jobs can be run at whatever frequency needed, be it daily, hourly or even more often if your operations require it.

Robocopy is the first step in all automated restore jobs, unless you want to add validation steps prior to the backup file copy. The following example copies all differential database backups from a production server to a DRP server:

A database-specific SQL Server Job will restore these backups daily to the hot standby server (DRP) using stored procedures specifically created for this setup, such as:

  • usp_DB_Restore_Master or usp_DB_Restore_Master_Multi
  • usp_DB_Restore
  • usp_DB_Restore_NoRecovery
  • usp_DB_Restore_differential
  • usp_DB_Restore_Log

A consideration for the DBA regarding the level of database recovery

If you are currently in Simple Recovery mode, and provided there are regular Transaction Log and differential backups (as in, several times a day), you can switch your recovery model over to Bulk-Logged in production to restore up to a specific point in time. This will naturally minimize the amount of data lost from the work session prior to any downtime.

Full Recovery mode is recommended for critical databases that require auditing compliance.

In the event of failure, the most recent log or differential backup is ready to be applied to the standby database sitting in norecovery mode, and you’re up and running quickly with minimal down-time.

An alternative method for a much smaller database, where the total restore time is below five minutes, is to apply the complete restore every hour to the failover server, in which case you don’t need to worry about norecovery mode.

PART 2 – Instructions to follow in the event of a disaster to the production system

  1. If you haven’t heard from them directly already, please contact FIRST LINE DBA SUPPORT at [INSERT NUMBER] or SECONDARY DBA at [INSERT NUMBER]
  2. After the production/original data publisher server failure (SQL1), the restore / backup-subscriber server (SQL2) will be used as the primary database server (a.k.a. DRP server). Inform everyone in the department by E-mail (It’s also worth thinking about who will inform internal/external clients).
  3. Once the switch occurs to the DRP server and the downtime of SQL1 actually happens, all application connection strings need to be changed to access SQL2. The CGI should handle this step automatically.
  4. Disable Automatic Restore SQL Agents on SQL2.
  5. Disable all SQL Agent jobs on failed server SQL1 if possible.
  6. Enable all maintenance and backup jobs on newly active server SQL2

Please note that restoring a log backup is not possible if the production database recovery model is set to Simple. For fine-grained restoration, the database needs to have been using the Full recovery model – Thankfully, the default setting is the Full Recovery model. If point in time recoveries are requested by management on a regular basis, then we can also change the database recovery level to Bulk-Logged, if space is an issue, and Full otherwise – Perhaps with deserved hesitation from the side of the Database Administrators, as Bulk-logged recovery is much more space efficient.

How the automation of the restore from compressed backup is benefitial to your production environment. Ideally you should keep two full backups, one on the Test server and one on the DRP server. Having this second copy of the production databases will allow you to do some useful and intensive work which you don’t want to have to run on live databases, such as a full DBCC CheckDB – console commands that can check the integrity of your exact database restore copy.

A log of what has been restored shall be placed in the following directory:

\\DatabaseServerName\drive$\prodBackupDir\DBlog\

As soon as a restore is completed, we should have an automatic purge of old backups – done perhaps every week (maximum 14 days manually, or automatically in a SQL Server Maintenance Plan), and which can be automated using a batch file or PowerShell Script.

To ensure a smooth restore process, we should read the restore parameters directly from the backup log system tables – such as BackupHistory, BackupSet, BackupFile or

Backuplog – unless a backuplog table is explicitly created in a local database or exists in msdb. This is to ensure that the essential restore parameters (such as the backup file name and position) are immediately available.

As I often set them, the SQL Agent Restore jobs have their parameters manually set during testing and are usually left that way – but of course it’s best to pull the meta-data directly from the system in case you move files around and forget to update the restore scripts.

SQL1 & SQL2 (Prod. & DRP) Server Hardware Configuration

This is the configuration for the servers this document was originally written for (I don’t remember the System Models for that setup, but that’s not to say you shouldn’t record yours) Update the following with your own server properties.

SQL1 (production instance)

1.1

Server Type

Windows 2008 (standard x64 edition)

1.2

System Model

[Server Model Number, Product Type]

1.3

RAM Memory

8 Gig

1.4

No. of CPU’s

2

1.5

CPU & Speed

AMD (x64)

Drives

Hard Disk Space

C(#G);D(#G)

SQL2 (storage replication partner / hot standby restore-subscriber)

1.1

Server Type

Windows 2008 ( standard x64 edition )

1.2

System Model

[Server Model Number, Product Type] – Same as SQL1

1.3

RAM Memory

9 Gig

1.4

No. of CPU’s

2

1.5

CPU & Speed

AMD (x64) Opteron Processor 280

Drives

Hard Disk Space

C(#G); D(#G); F(2TB); G(250GB); H (1.5TB); Z(20GB)

This server should have terabytes and terabytes of space, depending on your archiving needs.

SQL Server Configuration

For a previous client our production build of SQL Server was 9.0.3152, so naturally the DRP server had to be the exact same build – both systems must be as identical as possible.

Our servers are using 64-bit versions of the SQL Database Engine 2005/8, with at least service pack 2 (2005), cu3 (2008) installed, and the collation type is Latin1_General_CI_AS (accent sensitive is recommended).   It is preferable to have at least Cumulative Rollup package 8 or SP3 for SQL Server 2005, and it’s important to do an update to production build levels of SQL on a regular basis.

Detailed information for the server and databases is included in the compiled help file located on both servers SQL1 and SQL2

D:\DRP\ServerName.chm (i.e. make it very easy to find DRP info)

Critical SQL Server User Database Details

1. List of databases

Database1

Database2

NB:  We will not be doing master, msdb, model or temp – these are backed up on a regular basis and will be copied by Robocopy although not restored onto the database restore replication subscriber directly.

 2. Database Maintenance Plan and Auto-Restore.

In general, our database restore plan will reflect exactly the backup schedule and wait for backups to finish by querying the metadata from the production server. The restore jobs will check to see if the days’ full backup has completed (or daily diff.) using the backupset.backup_finish_date  column.  Once we see that Full backup has been completed on the production server, we copy the backupfile over to the hot standby server. In the second step of the job, we continue to execute the code from the appropriate usp_DB_restore combined with the metadata extraction from the system tables.

3. Database Backup schedule in production

Maintenance Job Name

Maintenance Job Description

Freq

Time to Run

BackupFull_Database1

Full Database backup Database1

W

Sunday 6:00

BackupFull_Database2

Full Database backup Database2

W

Sunday 6:30

4. Restore jobs on DRP server

Maintenance Job Name

Maintenance Job Description

Freq

Time to Run

BackupFull_Database1

Full Database backup Database1

W

Sunday 6:00

BackupFull_Database2

Full Database backup Database2

W

Sunday 6:30

Critical Scripts, Procedures and Programs related to disaster recovery

Following is a list of all the code used for the DRP process from SQL1 to SQL2:

usp_DB_Restore_Master

usp_DB_Backup & usp_DB_Restore

usp_DB_Restore_NoRecovery 

Same as above, but for databases that need to be left in no recovery mode (e.g. waiting for a log backup to be applied or differential)

usp_DB_Restore_Differential

usp_DB_Restore_Log

Should allow multiple logs to be automatically applied

usp_RoboCopy

usp_KillConnections

System Database Backups

On the DRP server itself the backups of the MSDB, DBAs databases, which are critical to this whole DRP process are located here:

\\DRPServerName:\DRPbackupFolder\Full 

There should always be an alternative local backup location for system databases, such as on the Test server.

All DBAs and system databases are backed up as well as on:

\\TstServerName:\TestSrvBackupFolder\Full

The following example was tested on a primary test server and exists on the restore server.  The usp_DB_restoreX stored procedure takes 6 input parameters.  To match up with backup log metadata, we shall match up the database name by date and then pull the relevant restore file input parameter into the appropriate usp_DB_restoreX stored procedure.  The master restore procedures, divided into single file and multiple file restore procedures, use all the sub procedures to do the actual restore process.

Please note that the usp_DB_RestoreX stored procedures are dependent on usp_KillConnections  which will help in the restoration process by killing the existing database users (that is, unless it’s a system user however).

e.g.

The stored procedure usp_DB_restore_norecovery is the same as usp_DB_restore, only for Databases that need to be left in norecovery mode (as described earlier in the article)

Please view the Activity History from Red Gate SQL Backup for reporting on what databases have been backed up, as the scope of this document covers the restoring process only. Although the backup information is extracted to prepare the automated restore scripts within the jobs, we are not going to create (at least at this stage) customised backup reporting information.  However, do not forget that, since we are using these scripts within a SQL Server Agent job, we will have histories for each step and a log file written to the \DBlog\ folder local to the disaster recovery server running these SQL Agent Jobs.

791-SQB_activityhistory.gif

Figure 2 – SQL Backup Activity Log

Database Restore method when applying Differential Backups.

Please note that we use usp_restore_db_norecovery to load a production backup from the local copy moved over using Robocopy. Thus, if executed on the DBA database of the DRP server (SERVER NAME / INSTANCE NAME):

This will be the core of what runs for the second step of an automated job which leaves the database in NoRecovery mode, and thus should call the respective RestoreDiff_dbx next and, finally, apply the log files via RestoreLog_dbx

After the restore, make sure to run several tests that ensure the integrity of the data and that typical applications can run normal operations on the database.


Summary

Is this disaster recovery method really minimizing the manual intervention after failure? Can we make it better? Yes and yes, but there’s always room to improve. More importantly, this method certainly doesn’t suit every environment. Before you take what I’ve put together here and run with it, I strongly recommend you take a look at the High Availability Options table below to get a clear picture of what methodologies might be more appropriate for you individual needs. To make an effective choice, you’re naturally going to need a detailed understanding of each clients’ needs for the restore process.

High Availability Options

Solution

Cost

Complexity

Failover

Failback

Hardware Clustering

High

High

Fast

Fast

Software Clustering

High

High

Fast

Fast

Replication

Medium

Medium

Medium with manual
processing

Slow with manual
processing

Continuous Data
Protection

Medium

Medium

Medium

Slow

Log Shipping

Low

Low

Medium

Slow

Backup and
Restore

Low

Low

Slow

Slow

Database Mirroring

Low

Low

Fast, but only at the
database level

Fast, but only at the
database level

At the time of writing, our backup-and-restore is super slow – at least 13 hours before we were live on the warm standby – but if optimization is run, we should be done in around 2 hours.

Nobody wants to go through a disaster without being properly prepared. When I was asked to prepare a plan for Canada’s largest institutional fund manager, I took it rather seriously, hence the length of this document.  We ran this through a real disaster recovery test over a weekend, and it all worked out just fine.  I’ve tried to share with you exactly how you can get your own disaster recovery plan in place, so that when the time comes at least the recovery step itself is not a disaster.