Dealing with Database Data and Metadata in Flyway Developments
Before you get very far with database development you need to be clear about your strategy for handling data. In this article I'll explain some of these issues in general terms, and then demonstrate how you can navigate these problems easily with Flyway.
The general issues with development data
When you are developing a database, you need lashings of data, fake data. When using Flyway, you’ll probably insert this data via one or more migration scripts. Why not, after all you’ll need this data for bugfixes and continuing development and maintenance.
However, having created a version that’s ready for release (a release candidate), you’ll then need to produce a ‘variant’ of it that’s at the same version but retains only the reference data, meaning the static data or ‘enumeration’ data, that is required to run the database. It must contain no development data. You can then bulk import test datasets into it for load testing and so on. These will be standard datasets where you already know the correct result of processing of the data.
For a greenfield development, this release-variant of the database will also form the basis of the initial production database. When you release a new version of an existing database you have the choice of either applying a migration script that does the modifications whilst being certain of retaining all the existing data, or the ‘retro’ approach of taking the database offline and copying the existing production data into the new version before putting the new version online.
In general, a new version of a database is built with DDL scripts; and whatever dataset is required is imported subsequently. A database is altered with migration scripts that will include only DML that ensures that any existing data is retained in the target database.
One must mention one last issue with development data: it has sometimes been the practice in the past to use actual production data to develop databases. It is now considered a bad practice and any organization that allows personal or financial data to be used in this way risks being fined. In some cases, it may be possible to use carefully obfuscated or ‘masked’ production data. However, I’ve generally found it safer to generate fake data, and often more effective because it allows us to use data that tests the edge conditions, rather than use just the data that you already know works with your database. This can include having people with the surnames Wolfeschlegelsteinhausenbergerdorff, Ng, Tester, St. James, Null or O’Brien, names that strike fear into any Application Developer.
Flyway makes no great distinction between DDL and DML activities in migrations. This is great if the DML involves preserving data while making changes to tables. The problems arise if your migrations involve a lot of data insertions, because such development data is of no use for a deployment, and they clutter up the migration scripts. Data changes, unlike metadata changes, don’t affect a database version, so you just need to keep any migrations that merely import data separate from the migrations that do DDL (i.e., create or alter database objects such as tables).
By doing this, you can apply just the latter scripts to create or alter a database without data. To create a database without data, you merely rebuild a new database, copying the DDL migrations but leaving out any migrations that import the development data. The resulting database will have the same metadata and the same version, even though the Flyway schema history table will be different.
The data operations
Most database developers will have various scripts for doing each of the required data operations, such as:
- Clearing out an existing dataset by deleting all data from tables
- Saving a dataset to file
- Maintaining a file-based dataset
- Inserting a dataset from file
- copying the data from source database to target.
Deleting all data from tables in the target database
Before you load data into a table, you usually need to delete the existing data. There are standard SQL ways of deleting the data in a table, but they can be slow for large datasets. Generally, they are ‘transactional’ in that they can be rolled back.
Each RDBMS has its own, efficient way of ‘bulk removing’ data. RDBMSs such as Oracle, PostgreSQL, MySQL and SQL Server, have a
TRUNCATE command, which clears out the entire table (you can’t apply a filter) and works much more quickly. Only in some cases is the
TRUNCATE command transactional: with SQL Server and PostreSQL it is, in Oracle it isn’t, and in MySQL it can be (innoDB).
Unless you replace it in the import operation, you must never delete the reference data that is needed for the system to run. Reference data might include geographical information, or other information that is unchanging. It is rare to come across data that is genuinely unchanging, and the less you have the easier the database is to manage. I use read-only views to store reference data, so they don’t have to be explicitly rendered ‘read-only’ and are classed as changes in the metadata (you must change the creation script for the view to change the data).
The biggest problem that you are likely to face is that you can’t delete data that is the target of a foreign key constraint. There are two approaches to this: either you can empty the tables in the correct dependency order, or you can disable foreign keys and triggers for all the tables then you can empty them in any order. Here is a simple routine that deletes all the data in a SQL Server database.
The commonest way of approaching this whole problem is to shrug, and instead create a build script from the release database. You then use this to build a fresh database without data. Once you have this it can be distributed for testing or used for deployment. I show in other articles how to produce a build script on every migration run (see Flyway Branching Walkthrough or Creating Database Build Artifacts when Running Flyway Migrations).
Saving a dataset to file
Some database systems are shipped with tools that allow you to save the data rapidly to the file system of the server hosting the database. Data can be kept in file-based datasets, but it is also possible to copy data directly from one database to another. Usually, when saving to the file system, each table would have its own file.
The format for the file data is traditionally CSV even though the format has never had a formal definition. Nowadays, a good device-independent format is JSON, especially now that it supports JSON Schema. If you need to deal with one format, it is worth using, where possible, the file format supported by the database’s ‘native’ bulk loading tool (such as BCP, for SQL Server). I say “where possible”, but if a table changes its format the corresponding dataset has to be edited and JSON is easily manipulated and changed.
Maintaining a file-based dataset
Now that there are so many databases that are based on JSON, it has become surprisingly easy to edit data, insert or remove columns, split tables and so on. Some of these databases convert their JSON data to a binary format, but you can export these in JSON format. It is also easy to read these datasets into a .NET application or script and do the data conversion there.
Inserting a dataset from file
I’ve shown, in another article, how to bulk load the data separately using SQL Server. This is my preferred approach as it means you only need to create a dataset once, and it can be used as many times as you need, and archive in case you need it later to fix an obscure bug.
Most RDBMSs have a bulk data-load utility, but there is no standard cross-RDBMS way of doing it. Any bulk load operation is likely to assume that the existing migrations haven’t put data into the database.
If you’ve used surrogate keys that are generated by the system, you must ensure that foreign key references are correct, either by retaining all these original keys with your dataset, or by making sure that the system generates the same keys. This isn’t always easy, but if you’ve used an incrementing number such as SQL Server’s
IDENTITY key you need to make sure that they start with the correct initial value.
If you delete existing data and then bulk load data, then you will need to rebuild indexes. Depending on the RDBMS, you may also need to recreate foreign keys. If you’ve disabled foreign keys and triggers, you need to remember to re-enable them after you’ve finished.
Copying the data from a source database to a target
This is just included for completeness. I show an example of this in Database Kill and Fill. It is sometimes easier to circumvent the need for intermediate file-based data in adding data to a target database. SQL Data Compare allows you to do this, if you don’t mind waiting, but even this requires an intermediate file-based SQL script with the
INSERT statements. It certainly works, and I’ve used it in the past, but you lose the advantage of being able to repeat the operation for test runs because you will always need to connect to the source database to get the data, and you lose the certainty that the data is the same between test runs.
Creating a variant with Flyway: a database without data
I’ve already described, in Flyway Teams and the Problem of Database Variants, how to use the shouldExecute config option in Flyway Teams to produce variants of the same version of a database, such as to support to different access control regime or messaging system or version of SQL Server.
In this case, we simply want to create a variant that has no data, and we’ll take a simpler approach that uses my FlywayTeamwork framework and which, in this case, requires only Flyway Community. I’ve added all the required scripts to create this variant to the FlywayTeamwork-Pubs project, so you can play along. I explain how to use this framework to support a Flyway branching workflow in an earlier article, Flyway Branching Walkthrough.
We create a new database in SQL Server called PubsWithoutData. This isn’t a branch or a branch version. It is a branch variant. Each variant is at the same version as the branch, and shares the migrations used for the branch, but will have minor variations. In our case, it has no data.
We are working in the develop branch, so we create a variant of that branch called NoData. We can create as many of these variants as we need.
In the Migrations subfolder of our NoData variant, we copy all our migrations except for the three that just import data.
In the NoData directory, we have our PowerShell initialization routine, the flyway.conf file and a JSON file that describes the directory names we assign for the various locations (some folks don’t like my naming)
The flyway.conf file has just this:
flyway.placeholders.projectDescription=A sample team-based Flyway project
So, let’s create our data-less variant. It just requires a small slab of PowerShell!
# we now set the password. This can be done as an environment variable. but that isn't quite as saecure #/
Processing the $($dbDetails.variant) variant of $($dbDetails.branch) branch of the $($dbDetails.project) project using $($dbDetails.database) database on $($dbDetails.server) server with user $($dbDetails.installedBy)"
Flyway $pword info
Flyway $pword migrate
If we then run this it gives the following (I’ve left out the reams of info messages):
Processing the NoData variant of Develop branch of the Pubs project using PubsWithoutData database on Philf01 server with user Phil Factor" Schema version: << Empty Schema >> +-----------+---------+-----------------------------+------+--------------+---------+----------+ | Category | Version | Description | Type | Installed On | State | Undoable | +-----------+---------+-----------------------------+------+--------------+---------+----------+ | Versioned | 1.1.1 | Initial Build | SQL | | Pending | Yes | | Versioned | 1.1.3 | UseNVarcharetc | SQL | | Pending | Yes | | Versioned | 1.1.4 | RenameConstraintsAdd tables | SQL | | Pending | Yes | | Versioned | 1.1.6 | Add Tags | SQL | | Pending | Yes | | Versioned | 1.1.7 | Add Indexes | SQL | | Pending | Yes | | Versioned | 1.1.8 | AddEditions | SQL | | Pending | Yes | | Versioned | 1.1.9 | AddconditionalVersion | SQL | | Ignored | Yes | | Versioned | 1.1.10 | AddAddressesPhonesEtc | SQL | | Pending | Yes | | Versioned | 1.1.11 | AddProcedureWithTest | SQL | | Pending | Yes | +-----------+---------+-----------------------------+------+--------------+---------+----------+ Successfully applied 8 migrations to schema [dbo], now at version v1.1.11 (execution time 00:06.843s) Executing SQL callback: afterMigrate - Add Version EP WARNING: DB: Recording the database's version number - 1.1.11 (SQL State: S0001 - Error Code: 0) Executing SQL callback: afterMigrate - ApplyTableDescriptions WARNING: DB: Adding the descriptions for all tables and columns (SQL State: S0001 - Error Code: 0)
Hmmm. Let’s inspect the data in our Nodata variant using some trusty SQL:
-- Get Table names, row counts, column counts and compression status for clustered index or heap
SELECT Object_Schema_Name (o.object_id) + '.' + Object_Name (o.object_id) AS [Table],
Sum (p.rows) AS [RowCount], Columns.Columncount,
p.data_compression_desc AS [Compression Type]
sys.partitions AS p
INNER JOIN sys.tables AS o
ON p.object_id = o.object_id
(SELECT object_id, Count (*) AS Columncount
GROUP BY object_id) Columns
ON Columns.object_id = o.object_id
WHERE index_id < 2 --ignore the partitions from the non-clustered index if any
o.object_id, data_compression_desc, Columncount
ORDER BY Sum (p.rows) DESC;
Sure enough, apart from the Flyway schema history table, and some tables with reference data in them, the rest of the tables are empty:
…whereas our develop branch database definitely has data:
The final step is simply to prove that the databases are identical, which is easy since I have a copy of SQL Compare, courtesy of the ‘Friends of Redgate’ munificence. Without such a tool, we could do the schema comparison, albeit a lot more crudely, by comparing the JSON models for the develop branch and its variant.
Now we compare the two and, yes, the database is identical. The only difference is the extended property documenting the variant.
So, we can confirm that, provided we keep data import operations in separate migrations to the metadata changes, we can create a variant of the database at any version we need, but empty of data so we can now import any dataset that we need for testing or staging.
Flyway is well-placed for a team-based development because it can deliver a database without data ready for the test cycle, staging and deployment and, at the same time have a variant full of development data that is far more convenient for rapid work. I hope that I’ve convinced you that it saves a lot of ad-hoc work.
Was this article helpful?