Rollback and Recovery Troubleshooting; Challenges and Strategies

What happens if your database deployment goes awry? Do you restore from a backup or snapshot and lose all the data changes that have happened since then? Do you prepare rollback scripts to revert the changes whilst preserving the data? Do you branch by abstraction and toggle off the changes? Have you a blue-green deployment that can be switched? Do you quickly roll forward?

DevOps, Continuous Delivery & Database Lifecycle Management
Automated Deployment

What happens when it all goes wrong? You tested the changes that you’re making to your database even before you checked those changes into source control. The continuous integration process tested them again. You went through more testing when working with the QA team. You finally tested the scripts against a copy of your production database in your staging environment. You had every reason to believe that the script was going to go just fine. But here you are, looking at a sea of red in the Messages window in Management Studio because your carefully validated deployment script just crashed and burned. Now what?

As a data professional or database administrator, one of your prime responsibilities is the protection of the business data. Changes to the structure and code of your databases can go seriously wrong, leading to down-time and data loss. Obviously, you’ll do anything possible to prevent this happening but this will just reduce the probability of things going wrong. The chance still exists of having a failed deployment, and you need to have effective ways of recovering from an event like this as quickly and effectively as possible. Have you the best possible ways to ensure that you can smoothly recover from your deployment disasters? What are the trade-offs of these various approaches? This article will walk through the different mechanisms you can use to ensure you have at least one effective documented procedure, hopefully more, to recover from, or even undo, a failed deployment.

Backups

The most basic protection mechanism for your databases involves having a tested backup from which you can restore the database in the event of any kind of error, including failed deployments. Your production systems probably already have backups in place that are run regularly. So, it shouldn’t be an issue to simply add a backup to your deployment process (although, if you are also using differentials, you will need to use COPY_ONLY for this additional backup). For most systems, it is extremely simple to script the backup so that it’s an automated part of the deployment process, so you don’t even have to make this a manual step. Further, in SQL Server, you can additional checks to the backup to help ensure that it’s going to be viable for an immediate recovery if everything goes wrong. These checks should include enabling CHECKSUM on the backup and running a VERIFYONLY restore after the backup completes. Once you’ve completed this, you will be able to ensure, should the deployment fail, that you can restore the database to the point you were at before the deployment.

But, there are several drawbacks to this approach. First, it takes a long time. Even relatively small databases of 200gb can take over half an hour to backup depending on the speed of your disks. The restore will take at least as long. This compounds the difficulties of this approach to rollback. In order to get the data back to the precise point before your deployment went south, you have two choices, neither of which are at all attractive. You will either need to prevent any data being added by removing all connections, usually by disabling the application or by setting the database to restricted user mode, or else you will need to run log backups in addition to a full backup in order to achieve a point-in-time recovery. If you go for the latter choice, then, the restore is going to take at least as long a time as the backup, usually somewhat longer, during which no one has access to the database. This delay in restoring the database increases radically as the size of the database increases: Add to this the complexity of a restore. I know, it’s just a restore statement really, but again, you must get restricted access to the database, you have to have the script prepared in advance ( you certainly don’t want to try to write the script during the deployment process) and you won’t have a chance to test it until you’re already in an emergency. Finally, the suitability of a restore-based strategy to undo a deployment depends on when you realize that you have a problem with the deployment. If you’re looking at the type of failed deployment that I opened this article with, where the errors are immediate, then the restore can work. But, if the errors are discovered after you’ve already started receiving new transactions, then you can only restore the database if you’re willing to lose data, something that is universally frowned on.

While I absolutely recommend taking a backup or making some backup procedures a part of your overall deployment processes, as a rollback or recovery mechanism, it should probably be the last resort. That means you need to look at other possible solutions.

Snapshots

Assuming you’re running SQL Server Enterprise Edition, you have the capability to use snapshot backups. For deployment processes, snapshot backups are a wonder. Instead of taking a long time to create a backup, the snapshot occurs almost instantly. With the snapshot in place, you have the capability of a very fast undo of all your changes. A snapshot marks a moment on the database from which all changed pages are then recorded. The undo process just rolls back those changes to the original pages, a very fast operation. Further, the snapshot gives you a place to compare the changed structures and data in order to validate your deployment.

As with backups, if you have an immediate failure, the snapshot restore can work. But if you’re dealing with a situation where the data has changed in valid transactions, you won’t be able to selectively undo the changes you’ve done, so the snapshot fails in the same way as backups. Finally, you do have to be on the Enterprise Edition of SQL Server which makes this another limiting factor.

If you are on the Enterprise Edition of SQL Server, I strongly suggest using snapshots with your deployments. I much prefer to validate the deployment using the snapshot than to use the fast backup and restore to do so.

Rollback Scripts

Depending on the types of changes being made, rollback scripts are likely to be one of the most difficult parts of the deployment process to create. If we’re adding or modifying code objects such as stored procedures, views or functions, an undo script is pretty easy to generate. There are two basic approaches you can use to do this.

First, use a compare utility or script to generate a script that would put what is currently in production into your staging environment. You do this after you deploy the latest change package to the staging server. You can then save this script on the side. You can even test it by rolling back the changes in your staging environment. Then, if things go badly during the deployment itself, you have this script already in place and ready to run.

The second, and frankly better, option would be to have your database code in source control. As long as you carefully label or branch the source control structures, you’ll always know what’s currently in production and what changes you’re introducing. You can then generate a deployment script from the previous labelled production deployment. Again, keep that script on the side in case you need it during deployment.

But, rollback scripts quickly become problematic when you add changes to the table structures. While it is relatively simple to modify indexes, and it is easy to add new tables, it is likely that you’ll require complex data migration scripts, or even SSIS packages, if you are modifying existing structures . In order to maintain a rollback script, you’ll have to build a reverse copy of your data migration script or SSIS package that does exactly the opposite of whatever script you used to deploy. You’ll need to very carefully test these rollback scripts too. They’ll actually need as much testing as the deployment script itself.

Rollback scripts may also suffer from the same problem as snapshots and backups, preserving those transactions that are committed after deployment starts. If you have changes to the data completed after the deployment has been validated, but you find that what was deployed is somehow problematic, you might be facing data loss during the rollback process. But, this is mitigated somewhat over snapshots and backups since it’s possible that the changes to the structure don’t affect existing data. If no column already in the database was changed when you deployed the latest version of your app, you might be able to run the rollback script, losing new data, but retaining the old data and changes to the old data. This makes rollback scripts somewhat more attractive than just relying on snapshots or backups.

I have used rollback scripts in the past, but to mixed results so I find it difficult to recommend them. It takes a lot of time and effort to create rollback scripts. If you use them, then they can pay off. But if you don’t use them, and you don’t use them three or four times in a row, you pretty quickly wonder why you’re spending all that time setting up the rollback script. I’m frankly not a fan. I would prefer to use a backup or a snapshot to fix immediate deployment errors. For longer term errors, I would prefer to use some other mechanism such as roll forward and reclaim that time and effort for other work.

Branch by Abstraction

Branch by abstraction is a fairly common coding practice, but you don’t hear about it being used within databases, mainly because we never called it by such a fancy name. Branch by abstraction is basically a method of introducing changes to a system, usually large scale changes, by introducing an abstraction layer within the code to deal with the fact that an underlying service is undergoing some substantial changes. The phrase refers back to branching within source control, but in this case, an actual branch is not necessary (although one may be introduced, I don’t want to go too far down the rabbit hole of source control management), but changes are made to the code to allow the abstraction layer to deal with the fact that there are changes occurring, without having to change the front-end code, or, only changing the parts of the front-end code that you want to change.

How does all this relate to databases? There is a built-in abstraction layer within databases that has, to a degree, been relegated to the background of some development processes lately, namely stored procedures. You can use stored procedures, views and in-line functions to provide a mechanism that provides an abstraction layer on top of the underlying architecture, in this case, the actual tables within the database. With stored procedures you can pass an additional parameter such as a version number or a flag and within the procedure, different code paths, usually additional procedures, can be used to satisfy different requests while the application code, both new and old versions, continue to function.

By setting up an abstraction layer, your deployments actually get easier since your changes are usually two-fold, a change to the structure and a change to the abstraction layer in support of that change. You’ll only be introducing changes to code that is ready for changes while leaving existing code alone. In theory this makes more frequent changes possible with less need for rollbacks. And, rollbacks are extremely simple since you only have to change how the code path behaves in order to arrive back at the older functionality, which was left in place.

Like every other approach, there are problems here. First, you must have a very thorough testing regime in place. If older code goes down newer branch paths, you could get lots of errors or, worse, bad data. Introducing the abstraction layer is still a deployment and subject to any number of issues, just like regular deployments. And, because data has to be retained, you still don’t bypass some of the fundamental issues with database deployments. Done incorrectly, instead of a mechanism for smooth deployments, you could be introducing more problems to your system.

I think this is a fairly standard approach within databases, but not one that had a label until now. The main issue you could run into is dealing with code-first, or code-only, approaches taken by some development teams when working with ORM tools. Views and stored procedures work well as an abstraction layer, which is exactly what they’re meant to be, but you may have to sell this to some development teams.

A/B or Blue/Green Deployments

Yet another approach to undoing a deployment is to never have deployed in the first place. What used to be called A/B deployments but now is being referred to mostly as Blue/Green deployments is a fairly simple concept. Have two copies of your database. Deploy to one. Validate that the deployment was good, now switch your application to that database, probably by switching the application from its Blue to Green deployment as well.

Blue/Green deployments provide a very simple and immediate rollback mechanism since the old database and the old code base are never touched in any way. You can simply switch back to the old system immediately. If you also add a mechanism to duplicate all pertinent data changes between the Blue and Green systems, you should be able to do the rollback process after the deployment has been released, even if transactions have occurred that would have precluded a restore, a snapshot restore, or some rollback scripts from being used. Blue/Green deployment is one of the safest and most efficient mechanisms to provide a rollback capability to your deployments.

This isn’t going to work for all systems though. First, you’ll have to have a small enough database that you can have two copies of it online on your database server at the same time. Most businesses and enterprises are not going to want to pay for two copies of a multi-terabyte database. You also have to have a very reliable mechanism to synchronize the data between the databases. You can’t use standard methods such as availability groups, replication or mirroring since the structures between these databases will be different at some point in the process. Instead, you’re going to have to build a data migration mechanism into the application itself, or as a secondary process. You’ll also have to maintain that migration mechanism over time.

I have seen Blue/Green deployments work with small datamarts and reporting systems where we would do the data load to the Blue server while the Green server was online, including structural and code changes. Then the switch entailed a bare minimum of downtime. This worked and worked well. I haven’t used it within a transactional environment, so I’m unsure of all the implications. If you can build a reliable data synchronization mechanism, this should work extremely well.

Roll Forward

Let’s face it, most of the time, deployments will go bad in one of two ways.

The first possibility is that you’re going to run the deployment and it’s going to fail spectacularly. If so , you would need to restore the database or undo the snapshot before you go back to development and testing to determine what went wrong and why.

The other possibility is that you’re going to run the deployment, it’s going to go swimmingly and you’re going to turn on access to the database only to find out, a day or a week later, that you have major issues. At which point, you don’t go for any of the rollback mechanisms. Instead, you prepare another deployment, maybe over a couple of days or maybe over a couple of minutes, but instead of rolling back the breaking changes, you roll forward more changes in order to fix the problems your last deployment introduced.

For many organizations this is the default approach because they’re willing to live with the pain. For some organizations, especially if you’ve adopted an aggressive continuous delivery or continuous deployment mechanism, this is the preferred approach, because it reduces and mitigates pain. Rollback scripts, branch by abstraction and blue/green deployments each require a lot of additional work. If you’ve already spent considerable effort getting an efficient continuous deployment process in place, simply preparing another deployment and getting it out the door may be a much less painful operation.

The real issue with the roll forward approach is that you could be introducing more and more problematic data to your system while you prepare a new deployment. This could exacerbate the problems you’re already experiencing or even lead to new issues. You may even be looking at having your system off-line until you can get that new deployment in place.

This really is a pretty standard solution to the problems of failed or broken deployments. While it can seem like an ad hoc and slipshod approach, if you take the time to marry it into the rest of your process, this approach just embraces the fact that sometimes things are going to go wrong and you’ll need to fix them in place.

Conclusion

Deploying databases has the inherent problem of retaining the data after the deployment. If we could just replace a database the way code gets replaced, deployments would be much easier. But, because this isn’t possible, you will have to have a way of protecting the data in case the deployment fails. But, you really shouldn’t be picking and choosing between these different mechanisms. Instead, you should be looking to combine them. Backups are your most fundamental protection for your database, so they should be included regardless of the other types of processes you set up. Then, you can put together rollback scripts where appropriate, or set up branching by abstraction in the places where it’s going to work for you and even combine both these with a blue/green deployment on systems that support it. All this is done with the knowledge that under some circumstances, you’re just going to have to do another deployment by rolling-forward. Combining these approaches gives you more flexibility and agility while still providing you with good, solid, protection to your businesses data.

DevOps, Continuous Delivery & Database Lifecycle Management
Go to the Simple Talk library to find more articles, or visit www.red-gate.com/solutions for more information on the benefits of extending DevOps practices to SQL Server databases.