Reverse-Engineering the Production Database into Source Control
Starting out a new database development with source control is relatively easy. To introduce source control into an existing database application can be more challenging. Phil Factor explains some of the fundamental steps.
Imagine that you are the bright new development team leader, and you need to get one or more databases into source control. You need to move development practices from their present state to a more rational and systematic one.
When you take over a lead development role, you may be drawn to one side and told that it isn’t possible to create the production system from the SQL source code in source control. You’ll be told that it doesn’t really matter because you are only amending the database when you deploy a release. When you hear that, just give a cynical smile. You can be pretty clear what your first priority is: getting the database under control.
Everything in a SQL Server database can be scripted out so you are almost certain to be able to generate the source code of the databases by reverse engineering it from what is running in production. It will be without most of the comments, and it may be written for the better understanding of a machine rather than a human, but it is there, if necessary. You will still have work to do because a real database will have a whole lot of other scripting that needs curation if we can get to the goal of pain-free rapid releasing.
Why would you want database Source Control?
There are several reasons for needing to source control a database, in the sense of the term that includes everything needed to build the production system.
It is an asset. If you are working for an organization, your database code is a corporate asset and needs to be looked after. All code must be clearly visible, and observable, to anyone in a responsible role within the organization, such as an auditor, security expert, administrator, or compliance officer must be able to check the code.
It is essential for co-workers: This may be necessary for debugging, forensic work, or for checking compliance with the relevant legislation. It is important for organizational safeguards.
It is a legal requirement. If anything goes seriously wrong, and causes injury, damage or financial loss, there will be an investigation. It should be possible, for example, to re-create a software product at any released version. If legal action is initiated, and code is produced as evidence, its provenance should be unquestionable: When it was created and by whom, for example.
It makes rapid releases possible: It is a prerequisite for efficient, error-free and rapid releasing of a database system to production because it eliminates any doubt about what is included in a release throughout testing, acceptance, training and staging.
It makes team processes easier: Code reviews, bug tracking, integration testing, and many other team processes are made easier if everyone shares a common understanding of what is in a development project at any point, and who did what. It allows easier branches and variants.
You’re likely to hit a few problems, such as…
The source code is patchy
Sometimes the source code is very patchy, for various common reasons:
- Developer Dave, who squirrels away his source code in an arcane directory structure that nobody else understands, on some network fileserver.
- Developer Ben who was trained to develop databases from a single build script and amended individual tables or routines by highlighting them and hitting ‘F5’. A couple of times, he forgot to highlight the code he wanted to modify, and the development database suddenly lost its development data and reverted to a previous version. Because he uses an entire build script, he still has not figured out how to get his changes saved in a team-based source control system.
- The whiz-kid consultant who drifted in for a few days and laughed at our code and went on to fix several indexes. Heaven only knows where the source code for them went, once he discovered that we had no source control in place
- The Agency staff who seldom bother with source code at all
Automation scripts are inconsistent, incomplete, or missing
Various attempts at automation of the various processes such as unit testing, build, integration testing or deployment have been attempted over the years. They exist as DOS scripts, some PowerShell scripts and there were some Python scripts that were once attempted but subsequently mostly abandoned when the Python hotshot left the company. They are in several directories somewhere in the network where these scripts are stored, but heaven only knows where.
The production system has a lot that is not part of the development. Leaving to one side all the SQL Agent scheduled processes, there is a replication system in place. If you try making the replication code part of the development, then it is likely to break the replication during the deployment of the next release. There is, of course, the access-control system with all the permissions. There are the extra distribution statistics added to try to improve performance. There is a SQL Audit system in place too, and a rather arcane system using server triggers and Service Broker for auditing purposes. Then there are procedures for a monitoring system that Ops use for all network devices. Then there is all the encryption stuff that the security team insisted was necessary to comply with GDPR, as well as the dynamic data masking.
The production system has any number of scripts, mostly in DOS batch language, that do much of the routine tasks such as backups and maintenance. There is an intimidating undergrowth of SSIS tasks that are used for exporting reports and schedule ETL jobs.
What should be included in Source Control?
In our fictitious example, there seems to be no clear shared understanding of where responsibilities lie. Even if your organization is committed to DevOps practices, you need this clarity, because everything in the production server should be include in source control.
However, some of the contents are the responsibility of the developers and should be in development source control, and some will be the responsibility of DBAs and Ops and stored in their Configuration Management Systems (CMS). There may also be some source code that is created separately by a third-party agency such as an application provider
All the SQL DDL statements for all the databases used by the application should be in the development source as should all the scripts (e.g. PowerShell, DOS) used to automate any development, test, or deployment process. You’ll also need all the data required for the database to function, such as enumerations. I store the SQL Data Generator file that was used to generate the development and test data.
In source control too, but probably separately, in a CMS, should be all the DCL statements that make up the access-control system and all the scheduled jobs that are required for the database. The CMS will also contain all the scripting for the production-only code mentioned above.
What shouldn’t go into Source control?
I like to store generated development datasets separately as these are bulky. I reckon that reports that are generated from the database, such as the reports of SQL Doc, or generated from PowerShell scripts, don’t need to be there, but should be stored separately. No personal data should, of course, ever be stored in source control.
Where to start?
The first thing to do is to attempt to get access to the database that would have been delivered to staging. That would, in an ideal world, consist of just the development database, before the production configuration scripts such as encryption, replication or audit were added to it. Having this will save you a lot of work, but don’t count on it being available.
If you can’t get a clean copy of the development database, it just takes longer. As a developer, you’ll need to ‘de-productionize’ the production server so that you have a source that has just the database objects for which you have responsibility, and which are in the correct state for delivery into staging. In staging, each release is then either loaded with the current production data and modified with the supplementary features that make up the production environment, or there is a scripted migration or ‘one-way synchronization’ of the differences between the database with the existing version. Whatever approach you use, the objective is to effectively combine the two whilst preserving the existing data.
Once you have a clean working script of the current release with all the database objects that you are responsible for, and we go into the details of how to do this in the next section, you can introduce source control retrospectively into database development work.
‘De-productionizing’ a production system for development work using SQL Compare
SQL Compare can be used to create a ‘scripts folder’, which contains the object-level source. First, you can use the Github application to create the base directory for the project and then use SQL Compare to create the scripts folder for you, but if you create it outside SQL Compare, you must ensure that there isn’t anything else in it. You are comparing your production as ‘source’ and the scripts folder as the ‘target’.
You will need to become very familiar with SQL Compare’s options. SQL Compare has a rich undergrowth of options that enable you to ignore things. Things? Well I’ve documented them all in this glorious article ‘Exploring the SQL Compare Options‘. It is things that look very much as if they should be in a CMS rather than part of development: Things like encryption, partition schemes and functions, event notifications, Dynamic Data masking, data compression, authorization, sensitivity classifications, role memberships and permissions. No database developer wants to be involved in this and by having them in source control it can cut across what the production DBA wants to do. If, for example, the DBA changes the sensitivity classification and you’ve saved the Extended properties that contain them, they will, at the next release, over-write all the meticulous changes that the DBA has done in the meantime.
When you ‘ignore’ a SQL Server construct, and then request SQL Compare to generate an object-level source directory, it will not include any scripting for that construct. It is as simple as that: you’ve filtered out authorization, partition scheme or whatever you choose. If, however, you are using SQL Compare to do a deployment, all that ‘ignoring’ does is to exclude that class of construct from the comparison so that they aren’t updated on the target.
I wish I could say that the process of extracting a clean development image is easy. It must be checked, and it is possible to misunderstand what an option does. It is best to allocate plenty of time for team discussion and experimentation. You will be curiously familiar with SQL Compare and the more arcane options by the time you’ve finished that well-trodden path. Once it is done, however, you will have an object-level source that can be placed safely in source-control. Be certain to save all the SQL Compare options to source control!
The great bonus of having done this work is that you can create a source for all the extras that are the responsibility of operations. If you compare the result of your endeavors with the backup of the production system, ignoring nothing at all with the options, you will have the source of everything you’ve excluded. You’ve clarified the boundaries, even possibly at the expense of flushed faces and muffled cries amongst the DevOps team. However, future deployments will go better once everyone gets the idea of mapping responsibilities. More to the point, you will now know for certain about everything in the production system.
Recreating a chronology of changes
At this stage, you will have a set of scripts that will re-create the development version of the current release without any production bits. The next step is to substitute in the human-generated code for the code generated by SQL Server, wherever possible.
Humanizing the current release code
Why this tedious extra stage, you might ask. There is always a temptation to abandon the attempt and start afresh, but original scripts are worth preserving wherever possible. If you’ve ever read the source code of an experienced database developer, you’ll know that it is more than fifty percent block comments or end-of-line comments. It is nicely formatted and presented in a logical order. By contrast, a generated script from SSMS or SQL Compare won’t have this useful added information. The original source of tables isn’t stored by the database at all, and only the SQL within a routine is stored. Also, it won’t be formatted to your team-standards. Tables and their associated columns, constraints, comments, and indexes will be in an order that is convenient for execution rather than understanding. For this reason, it is worth preserving where possible any well-constructed source code for any table or routine that exists.
To do this, you can compare the human-generated, and machine-generated code using SQL Compare to check that they produce identical results, and then swap the human-produced version for the machine-produced version, where possible.
Putting any missing history into Source Control
If you haven’t got a good source control record of changes, you will have to do as much as you can to create it from whatever exists. This starts with an exercise of retrieving as much as possible of the existing back history archive of scripts that are sufficient to be rebuilt into a reasonable source control record.
Once you have this back history of scripts in ‘master’, you will need to ‘rebase’ it against the object-level source code of the latest production release that you have just created, stored in a branch. Each source control system has a different way of doing this, and the technique will vary according to how you are doing your source control. The main reason for saving the creation of each object in its own file is that it reduces the chance of collisions and subsequent merges, and it is easier to attribute changes to specific developers. To match this object-level source code with what has gone before, you would want the commit log to look like a linear history, so that it appears that all the work happened in series, even when it didn’t. Hopefully, it will look like a logical progression where files were re-arranged and slightly altered to represent the current release.
The hardest part of putting a database, or set of databases, into source control is ensuring that you have a complete record of the source of all the components of the production system, stored in repositories as appropriate. The acid test is to build the production system in its entirety from the source. This isn’t done to in order that you can do a release that way, because you’ll want to do it via a deployment in order to save time and preserve the existing data. It is done only to prove it is possible. It is reckless to attempt a reliable deployment without the confidence of knowing that the source is complete and correct. You build a database to build confidence: to prove that it is possible to recreate a release. Once you know what is in a release, then a migration script that takes a database from one version to another, once tested, will reliably change a database from that one version to the other. You only need one well-tested migration script for each release.
Was this article helpful?
Tools in this post
SQL Source Control
Connect your databases to your source control system