Solving the Provisioning Problem in Database Development using Clones
When database development is described, the details often get vague when the data gets beyond spreadsheet-size. There is 'hand-waving' talk of providing databases for each developer, but little detail of how you would provision all the databases that would be needed, at the correct version and with the correct development data, and then keep them all in sync with the source code, as developers commit changes. This article explains the requirements, and how SQL Clone can meet them.
If the database provisioning problem isn’t tackled firmly, database development can become chaotic, and testing can diminish in the face of it. The teams can be forced to adapt and compromise their database development methodologies, such as limiting their use of test-driven development, just to get around the unsolved provisioning difficulties. Demands for disk storage capacity and server hardware will balloon. The developers can also hit the problem of uncertainty around what is, and what is not, in the next release. This leads to the wrong version being tested and bugs getting to production. Database Developers have a term for this situation that cannot be repeated in print.
Surely all that developers need is a copy of the database?
Although there are occasional times that a shared development database will suffice, database developers often need to take plenty of liberties with copies of the current database. Any database development task ends up with a series of tests each of which requires a set-up, to create the database at the new version, including the data, and then after the tests complete, a tear-down that puts the database back to its original state, both metadata and data. You do it to prove that your code works properly and correctly according to requirements, even when confronted with bad data. By ‘properly’, we are also referring to performance and scale.
If you are extraordinarily lucky, this requires no automation because you’ll only need to run the tests once, to prove that the code and data are correct and proper. For the rest of us, it means a series of failures, corrections, reruns, until at last the code passes all conceivable tests. I recently counted the number of times I’d rebuilt a database, metadata and data, from scratch in the past hour. It was twenty-eight times. Admittedly, it was a series of pesky bugs and performance issues, but this is the reality of getting things done quickly and reliably.
Why be so fussy about testing in development?
After all, the code you are working on and testing probably isn’t even ‘in the build’ yet, but in your development branch. Once it is in ‘main’, any problems will, surely, be picked up by the automated integration tests for the current version? Not necessarily; they may not be picked up until the more expansive testing on a release candidate, with specialist testers involved, and that means that the release candidate will be rejected.
Nothing slows down delivery as much as a failed test in a release. You must play the game as if there was no systematic testing, which is sometimes a scary reality. Also, the testers may not be able to do the more specialized tests at all: For example, I’ve never met anyone other than a professional database developer who understands how to test a table properly, including all the required constraints, and ensuring no bad data can get into the table.
Typically, before merging new functionality into ‘main’ to add a new feature to a database, you, as a developer, will probably start by unit-testing the individual components, such as tables, views or functions, and then increase your reach to bring in other components in order to test processes and their integration. The test you devise for new work are then added subsequently to the routine automated tests that are done on the build.
To do all this requires a set-up. You need a copy of the current build including data, and a script that changes it to the current state of the branch that you need to test. This ensures that the change script is tested, and that the database is at the correct state. Then, you’ll apply your test harness to allow timing, warnings, and reporting and run the tests. Upon completion, you’ll tear down the tests so that the database is back at the state it was before the test ran. All this must be quick to do and use as few resources as possible, because you’ll sometimes be repeating tests every couple of minutes.
Towards a rational database development environment
The system adopted for the development of a database mustn’t be tedious or restrictive. On the contrary, it should provide advantages for the developer and cut out repetitive or tiresome rituals. The most important advantage is to provide each developer with access to copies of the following databases:
- Current build – the current version of the database as represented by the main branch of source control
- Development version – their “work in progress” version
- Released version – what’s in production (and staging)
The released version is the development copy of the version that was delivered to production, hopefully via staging. It is unlikely to have real data but, for the purposes of testing, it is important that the data is ‘controlled’ and any masking or data generation is consistent between releases. This allows integration tests to have a standard data set.
Production could easily have different access control, or additional features such as replication or encryption, but this isn’t part of the business of database development and therefore shouldn’t be in the development copy of the release.
Versioning a database
In versioning my database projects, I like to use an adaptation of Tom Preston-Werner’s ‘semantic Versioning specification’. In database terms, this boils down to:
<revision>.<release>.<version><-branch>.
There seems to be little consensus amongst developers that I know of, and one would want to fit in with the standards of the organization. I just use this as an example. Version numbers should also be applied to the database in such a way that a migration script will only work between the versions for which it was designed.
In The Database Development Stage, I show one way to apply a version to a database, via a Database_Info
extended property.
The Build
The best way of ensuring that the version of the database you are working on represents the code that has been checked in, is to build it regularly from the source code, and stock it with data by a bulk process; preferably as an entirely scripted process. There are several ways of doing this depending on how you do your scripting: The easiest to manage is a single build script. A simple database can be created from the checked-in migration scripts applied in the correct sequence, but the only advantage of this is to avoid the bulk import of data.
If you store scripts as object-level files, you must get the order of scripts right, when running a build. To deal with this type of problem, you can use SQL Compare to compare the current source code with an existing development copy of the database and to generate a “1-way synchronization”, or deployment, script to bring the development copy of the database to the correct version.
Supporting Branches
This current build will be used for development until the branches that together will constitute the changes in the next version are merged into ‘main’ and the next build is created. As discussed earlier, each developer needs access to the current build, the current release, and to the database representing the current branch on which he or she is working. These three databases allow the generation of all the necessary migration scripts. By comparing the current build with the branch, all differences can be shown, and a migration script saved to take the database from vCurrentBuild to vCurrentBuld-tag (where tag refers to the Git tag that denotes the branch)
By comparing the current build (as ‘source’) with the current release (as ‘target’) a migration script can be generated that will take the current release to the metadata of the current build.
The practicalities of provisioning
Having delivered a new build, filled with the required data, the team can then create the databases for any branches simply by copying it. The standard way to do this is to restore a backup. You can also detach and attach (unmount and mount) a database. You can use a container image for your build and produce copies of that or use a VM.
However, I use SQL Clone for doing this because it requires the least scripting, effort, and resources. SQL Clone was originally devised for developing a Redgate tool that had to be tested on several different databases at different SQL Server versions, the invention of a Redgate developer who was simply frustrated by the awkwardness of the traditional approaches to copying databases.
The potential provisioning nightmare
The number of databases required for development can balloon quickly. Take a realistic case: we have a team of six developers working on an accounting application with a strong database component. Although much of the metadata work can be done on a shared database, this becomes difficult for any integration testing because data will change. Unless the work is carefully allocated, there is also the risk of two developers working on the same objects, thereby making subsequent merges tricky. Each developer, in this case, needs to work on a branch, and so this team needs a copy of the current version of the database for each active branch.
As an added complication, the software will be used internationally in five different legislative areas. Suddenly this means, we need not six but 30 copies of the database. Also, every integration test must be performed on each of the five variants. Under demands like this, even container technology can creak, but with SQL Clone, there is still only one image. Your network traffic might go up, when you run automated integration tests on all the variants, but not the storage requirement.
Self-service with the SQL Clone canteen system
When SQL Clone is set up properly, it allows self-service. Developers can decide when they want a clone. SQL Clone can apply any migration scripts required to bring the cloned database to the state required.
Once the build of a new version succeeds, you can produce clones for the automated integration tests, making them easy to run in parallel. If these tests pass, you create the development clones for each branch.
A developer working on a branch will makes changes to their local clone, commit changes to that branch in source control, such as by using SQL Compare, and save the migration script to a script folder. To start a test on a branch, each developer needs a clone current build of the database with the branch modifications added to it, including all the new objects, and modification to existing objects, and any necessary data migrations or additions for that branch.
With small databases, you can run a full build of the current version, plus bulk data load every time, since it takes only a few seconds, and then you apply your migration script to bring it the branch version, and run the tests.
For bigger databases, doing this, or restoring a backup of the current build, will start to take too long, so instead you can simply reset the clone to the current build, and run the required migration script, either using a PowerShell script or through the UI. If you have team-members who are less confident with SQL scripts, they can create clones with the modifications applied to them using a SQL Clone template.
If this is set up to run a file at a specific location, usually the migration script within the source control directory, this can be automatic, at the click of a mouse.
Tackling database variants and branches with SQL Clone
This system can be extended easily to accommodate cases where any version of the database has several variants depending, for example, on the tax regime and legislative area. This would typically require different manifestations of the routines (procedures or functions) that would accommodate the different calculations or manipulations in the different variants. An EU-based accounting system would have to comply with EU law, for example.
Each variant is represented merely by a migration script that is applied to the current version. This allows a clone to be quickly created for each variant, from the current build, or allows existing clones to be updated. This would require the migration scripts to be updated and checked in, whenever necessary, before the build.
SQL Clone can do all this via the user interface, but when PowerShell scripting is added, then much of the configuration work can be automated, and the creation of all the variants triggered by the build.
The use of clones to create branches to their current state requires the developer to make sure that the migration script is kept up to date, but this is usually no more than saving the relevant query window. It is even possible to use command-line SQL Compare to create the migration scripts for the branches as part of an automated process for checking changes to the branch into source control, so that the differences between the branch database and the current version of the database are preserved continuously as change scripts.
Conclusions
Database development techniques represent the art of the possible rather than the science of resilience. I’ve yet to meet a database developer who is confident about the resilience of what he creates, because it is so difficult to run all the required tests. The problem is in spinning up a database quickly at the right version with the right data in it. A hardened database tester would cry like a baby if given a provisioning system that would provide a rapid, reliable, and efficient set-up phase of a test.
I’m old enough to have tried every method of doing so, and I’m happy to advise that, for small databases, it is most practical to use a single build script with a subsequent bulk-insert script. However, for any database that takes more than thirty seconds to build this way, I’d use SQL Clone. The ‘killer’ reason is that it scales because any version of a database needs only one image, for each dataset with which you develop. This means that you can place databases liberally on servers with only moderate local storage requirements.