Product articles SQL Change Automation Schema Deployments
Handling Tricky Data Migrations during…

11 September 2019

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

11 September 2019

Handling Tricky Data Migrations during State-based Database Deployments

This article demonstrates how to use a 'state' approach to database source control, when the nature of the database changes cause you to hit difficulties with migrating existing data. These difficulties happen when the differences are such that it is impossible for any automated script to make the changes whilst preserving existing data.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

In the classic state-based approach to database development, the source control directory contains, in one or more files, the SQL DDL statements that can build the database, if they are executed in the correct order. These DDL files can be created in several ways, but the current version of the source code must contain everything required to build the database.

Frequently, you don’t want to do a build, but instead you need to update all the existing development databases to reflect the current build version, or the relevant development branch of it. To do this we need a software ‘database synchronization’ tool such as Redgate’s SQL Compare or SQL Change Automation, or Microsoft’s data-tier system. Such a tool works by creating a ‘model’ object of the database from the source and comparing it to the target database. From the resulting map of all the differences in the metadata between the source code and the target database, the tool will generate an automated script that will, if executed, alter the target so that its metadata conforms with the source.

This saves a lot of time because even minor changes can take work in SQL. You’ll know the dreaded message in the SSMS table-design tool.

‘Saving changes is not permitted. The changes you have made require the following tables to be dropped and re-created. You have either made changes to a table that can’t be re-created or enabled the option Prevent saving changes that require the table to be recreated.’

There are a host of changes that a good synchronization tool can make without having to recreate the table, such as explicit changes to datatypes, or applying the default values to existing NULL values when you make a column NOT NULL. It can generally do it under the covers while preserving the data. If you do this by hand, you soon realize the time and tedious effort that a good sync tool can save.

Where problems lie

In my experience, things rarely go wrong with a database build or update done using synchronization. When they do, it is generally sorted out by tweaking the synchronization options. The difficulties come if the target contains data and there is more than one possible way of migrating the existing data from an old table design to a new one. What goes where?

The extensive renaming of columns can sometimes cause any sync tool to shrug and give up when there is no single obvious way of working out what happened. A table split is renowned for causing problems to a deployment that updates an existing database, because it is difficult for the tool to divine where the data will all move. It won’t necessarily cause an error unless there is a danger that you’ll lose data in the process. Otherwise, it just won’t get done. Similarly, a conversion of the data from one datatype to another that isn’t implicit, or some other large-scale re-engineering of the tables can also cause grief.

At some point, the changes are such that you can no longer update the target database purely by automatic synchronization or by using table ALTER statements. In fact, automatic synchronization that preserves the existing data is never fully guaranteed, and the less similar the source and target, the less successful it is likely to be. Boyce, Codd and Date would surely shake their heads sadly at the idea.

Instead, you need to ‘flag’ those schema changes that might affect existing data in unpredictable ways, and for those cases provide the automated build system with some help, in the form of a manually written script to deal with the complexities of determining where and how the existing data should go in the new arrangement. These scripts will need to be executed before running the synchronization. You can include in your source an entire narrative of changes to describe how to migrate the data, but if you then simply let the synchronization tool compare that source to the target and generate a script, any extra executable code in the source that isn’t part of an object definition won’t be included.

Solving the problem

Let’s take an example. We’ve created a version of pubs with a respectable amount of data. Full instructions on doing so are here: Generating Data for Database Tests.

We now decide to do some re-engineering. We’ll start by converting the archaic datatypes into types that reflect current best practice. We’ll also try to remove some nullable columns. However, we then spot that at some point, an unnamed and ill-advised developer decided to introduce a rudimentary ‘tag’ system, by adding a comma-delimited list of publication types as an extra column, TypeList, in the Titles table. The full details are here: Database Migration Scripts: Getting from place A to place B.

Searches on this column are taking far too long and it is doomed, so we decide to split out the tags into a many-to-many arrangement as shown below. We then stock the two new tables. TagTitle and TagName, appropriately with the data in the TypeList before tidying up by removing the TypeList column.

We need to write a migration script to create the two new tables and migrate the data into them from the TypeList column. A subsequent synchronization will apply all other required changes (and can now safely drop the TypeList column, which no longer exists in the source).

The migration script will need to use guard clauses to ensure that it can be run more than once without causing damage. Before making any change, each guard clause first checks to make sure it is necessary and does nothing otherwise. This technique, where we don’t assume anything about the state of the target and instead check with guard clauses before executing any DDL statement, is explained in Idempotent DDL Scripts That Always Achieve The Same Result – Making Changes Only Once: Stairway to Exploring Database Metadata Level. It is necessary because we must assume that the script will somehow get executed in all sorts of inappropriate times and places.

Here is our code. The XML is used purely to shred the TypeList into a relational table. The String_Split() function now does the same thing and makes the code a lot less impressive but I wanted to keep this so everyone could play along.

Automating synchronization-based deployments that require data migration

We need to incorporate this sort of migration script into an automated deployment process, which uses synchronization. SQL Compare always does deployment using synchronization, and SQL Change Automation does too in cases such as this, where the source is simply a set of object-level build scripts (or even a single build script).

Some migration scripts can be run after the synchronization, if the synchronization runs error-free. This is easy to manage because we simply place the migration script in a folder within the scripts directory called \Custom Scripts\Post-Deployment. SQL Compare or SCA will simply attach it to the end of the synchronization script automatically.

However, in this case we need the migration script to take care of the awkward preliminaries before the synchronization process runs. However, we can’t try running it as a pre-deployment script either, by putting it in scripts directory called \Custom Scripts\Pre-Deployment. If we do this, Compare or SCA will still run the synchronization first, and then simply append the migration script to the start of it.

Instead, we need to execute it on the target, as a separate step, before running the synchronization. This prepares the target so that the subsequent comparison and synchronization will work without a hitch. It doesn’t subvert the work of the synchronization at all.

If your guard clauses work well, it can stay there over subsequent deployments since nothing will be executed. It would cause havoc if you wanted to resurrect a previous build so you might need a check on version number. I usually specify in the filename of the migration script the versions for which it is appropriate.

…With SQL Compare

If you are using SQL Compare, all you need to do is:

  1. execute the migration script on the target
  2. Do a SQL Compare synchronization between the source code files as the source and the target, as usual.

…With SQL Change Automation

Here, you’ll want the whole process scripted. Again, don’t try using the migration script as a pre- or post-deployment script, as it won’t work, and don’t put it in your scripts directory either, because SCA will then attempt to include it in the build, with unpleasant results. Storing it in a parallel directory will be fine.

Doing a rollback

What if something goes wrong and we need to roll back to the original database version, in other words with the original version (where the TypeList column contains a list of tags) as the source. We’ll need to run a ‘reverse migration’ to safely migrate the data back into TypeList and we must run it before any synchronization because it relies on the data in the tables that the synchronization will remove.

In restoring the list of types or tags in the Titles table, we will have duplicates removed and the tags in alphabetic order, so the data can’t be said to have been exactly restored (the data generation wasn’t perfect in that it had duplicates and the tags weren’t in alphabetic order).

I would keep this rollback script in source control but well away from the source code. I assume you are using a database versioning mechanism. If you are using SCA, there will be a version number stored in the extended property of the database with the name ‘DeploymentManager Deployed Package Version’. It stores the PackageVersion that you assign to the source, when you use it with the New-DatabaseBuildArtifact cmdlet, to create the build artefact.

In PowerShell, you can implement the logic ‘Execute this particular reverse migration before the synchronization, only when the source version is less than when the table redesign happened, and the target is at the same or higher version.‘ As you may well have a different way of managing versioning, I won’t spell out the code.

Conclusion

The problem of data migration only becomes apparent when database deployments need to preserve the existing data in the target databases. If you have a good synchronization tool that can cope with all the common problems that have only a single possible data migration path, you will rarely need an extra migration script. If you do, then it is crucially important to develop and test it while the DDL code changes that cause the problem are being made and checked in.

The migration of data is the most important aspect to releasing a new version of an existing database to production. Therefore, to avoid unpleasant surprises at release time, the development team needs to get into practice, by doing automated synchronization from source regularly to a target with plenty of data in it, to iron out any such problems as early on as possible.

You might think that a complex logic would be required to know when to apply the migration script, but in fact a script can follow the logic easily if the version of the source and target database is known. The migration is only required when the source database is at the same or higher version than the table redesign, and the target isn’t. The ‘reverse migration’, for a safe rollback, is required only when the source version is less than when the table redesign happened, and the target is at the same or higher version.

When the deployment is scripted, the logic is easy to insert. Because the migration script can be re-run without effect, one can afford to be relaxed about re-running it unnecessarily, but one wouldn’t want to execute the wrong one!

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

You may also like