Product articles Flyway Database Testing and Quality
Bulk Loading Data via a PowerShell…

Bulk Loading Data via a PowerShell Script in Flyway

The test data management strategy for any RDBMS needs to include a fast, automated way of allowing developers to "build-and-fill" multiple copies of any version of the database, for ad-hoc or automated testing. The technique presented in this article uses a baseline migration script to create the empty database at the required version, which then triggers a PowerShell callback script that bulk loads in the right version of the data.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.


This article is part of a series on managing test data in a Flyway development:

Also relevant are the series of related articles on Database testing with Flyway.


Why would a database developer need anything other than SQL INSERT statements to generate or load test data? Well, there are at least two points in any database build or migration process where you might need a better approach.

The first is when you need to load a high volume of data at a reasonable speed, such as for performance testing. Ideally, the quantity of data will be in the same order of magnitude as the actual or expected production data. If you attempt to load high volumes of data with individual INSERT statements, it will slow down the build dramatically. Even with the latest iteration of the Pubs practice database, I can reduce the build time from twenty minutes (INSERT scripts) to under a minute (bulk copy).

Each database system has its own way of doing bulk data import and export; it is a general requirement. SQL Server has BCP and I demonstrated how to use it to bulk load test data for a Flyway build process in Getting Data In and Out of SQL Server Flyway Builds. Likewise, PostgreSQL has its COPY command, , MySQL its LOAD DATA INFILE or mysqlimport utility, Oracle has its BULK INSERT, BULK COLLECT and FORALL, and so on. Data files won’t necessarily be stored on the machine hosting the database, and so a script executed on a workstation is much better placed for the task than a SQL batch.

The second task that usually needs to be scripted is the generation of randomized data for development work. Again, you need plenty of it, because every bit of SQL you ever write looks great until it faces a tsunami of data. Although it’s possible to generate data using SQL, the process is much easier in a scripting language because you are likely to need to keep many different data banks on file to use for different test cases. I tend to use a great deal of typical text as a resource, which I reduce to Markov tables to create fake but convincing data. All this data needs to be imported rapidly into the database.

Any system for handling bulk data boils down to getting data in and pulling data out. These two activities are best contrived as two matching processes, import and export, because most bulk loading systems have a lot of options for the formatting of data, and infinite possibilities in the naming of data files.

Importing the right version of the data

In any relational database, data is likely to be version-specific: when you are importing data at speed, it will only squeeze into the tables for which it was designed. When you are doing automated testing, the data must be correct for the test run because you will be checking the result to ensure it is what is expected.

There isn’t necessarily a one-to-one ratio of data edition to version; you may need more than one dataset for each version. You generally only hit problems with the data if you change the base tables in any way. The more mature a database becomes, the less inevitable these table changes become, and the more likely that one data set will last for several versions. However, there will inevitably be mapping of database version and data version at some point in the database lifecycle. To accommodate this, I’ll introduce a method that is reasonably flexible.

Creating the test data

There are many ways to create test data. You can, for example, start with a fresh build without data, anonymize real production data or top up existing development data. Data generators are often scripted or come as tools. Whatever you use, they come with a ‘plan’ of how to generate data for each column, and how the data interrelates. I’ve written several articles on how to do this, and I often use SQL Data Generator for the job and then transfer the data to whatever RDBMS I’m currently using. Whichever way you do it, you then must create the file-based storage for the table data. Although BCP can insert CSV files, SQL Server’s fastest way loading of data, for example, is using native BCP, meaning that the data is in the binary format required. PostgreSQL’s COPY command is similar in that respect.

As discussed earlier, we need to get data in and out of a Flyway-managed database in such a way that we know the database version to which it relates. If we can do this, then it means that we can create test data for automated post-migration integration tests. This may seem a paradox, but the integration build of the database will be done after the unit testing in development, so the test data is likely to already exist that is needed for the type of automated testing done during deployment.

I’ll explain how to deal with some of these complications by illustrating it with a development scenario

Scenario: integration testing after a merge

Let’s imagine that work on a feature branch is complete, and the team have devised a single migration script to merge the changes into the parent branch, to create the new version. However, before committing the script to the parent, Develop, say (or to main if this is an external release), they want to perform sufficient integration testing to be confident that the migration script will work as expected, preserving all existing data, and that the resulting new version works to requirements.

One possible workflow for this will start by creating a private, empty ‘test build’ of the current database version (in Develop). They can do this using a ‘repeatable’ build in Flyway community but a better way is to use a baseline migration (a.k.a. state) script in Flyway Teams.

The team bulk load it with test data, and then apply their migration script containing the feature changes, which will also need to migrate existing data and add any new data as required. This enables them to run enough integration tests to be sufficiently confident to update the Develop branch with the migration script. This triggers the build of the new version and, assuming all is well, they can then export and save the data for the new version, as well as create a baseline migration for the new version. Automated and ad-hoc testing is now easy because the existing databases can be upgraded as usual, with a migration script, and newly created databases, whether they are on new docker images or traditional database servers, can be built rapidly at the correct version.

This is just one example, and your own workflow will vary according to the type of database and its importance to the organization. Staging will have rather different requirements. However, one thing is certain: to do testing you need to get data in and out rapidly.

For our demo of this workflow, we set up our branch project. Here is the Scripts subdirectory:

Flyway baseline migration script and PowerShell callback

It contains a Baseline Migration script, which just builds the schema but inserts no data. It’s rather like doing a traditional build except that Flyway gives the resulting database is given the correct version number and maintains a version history, so that we never lose track of what is where. Therefore, when you create this “B” script, make sure you don’t include the Flyway history table as that would confuse Flyway. Also, remove any extended properties to do with the database.

The second file, a PowerShell callback script, is called once the migration has been done. It determines the version reached, and then looks to see if there is an associated data directory with the bulk data in it. We’ve chosen to store this data within the project. Fortunately, we already have it so it can be imported:

Maintaining a Data folder for each database version

You’ll notice BCP files in different files for different versions, each in their own directory. If a data set for the version exists, it fills the database with it. Otherwise, it looks for next lowest version that does have a data folder, assuming that the new data is saved at the version where the structure of the data last changed. The script imports the data into the correct tables.

BCP files for each table

When we do the Flyway migration, we get the following output:

Successfully validated 1 migration (execution time 00:00.158s)
Creating schema [classic]…
Creating schema [people…
Creating Schema History table [PubsSix].[dbo].[flyway_schema_history] ...
Current version of schema [dbo]: null
Migrating schema [dbo] to version "1.1.12 - BuildSchema"
Executing script callback: afterVersioned - ImportData
Executing powershell S:\work\Github\PubsAndFlyway\PubsFlywayTeamsBuild\Scripts\afterVersioned__ImportData.ps1
scriptblocks and cmdlet loaded. V1.2.40
Executed FetchAnyRequiredPasswords
Executed GetCurrentVersion
Executed BulkCopyIn

Successfully applied 1 migration to schema [dbo], now at version v1.1.12 (execution time 00:50.149s)

We then do an info action

----------------------------------------
Flyway Teams Edition (10 schemas) 8.0.1 by Redgate licensed to red-gate.com (license ID c6c1a4e9-632f-4cd0-8f57-c27e360e0341) until 20
24-08-24
----------------------------------------
Schema version: 1.1.12
+-----------+---------+------------------------------+--------------+----------+----------+
| Category  | Version | Description                  | Type         | State    | Undoable |
+-----------+---------+------------------------------+--------------+----------+----------+
|           |         | << Flyway Schema Creation >> | SCHEMA       | Success  |          |
| Versioned | 1.1.12  | BuildSchema                  | SQL_BASELINE | Baseline | No       |
+-----------+---------+------------------------------+--------------+----------+----------+

Seems OK. We take the database for a quick spin just to make sure that the data is in there OK.

Querying a view

We can review some details about or books and their publishers:

checking books and publishers

And the bitter complaints of our less than satisfied customers…

Customer complaints

So, all you now need is a directory with all the connection information for all the databases that need to be provisioned at that version, and away you go. In terms of the build, it is remarkably quick, so can be used for provisioning as many databases as you need as quickly as using Docker images.

Running the Flyway Build

We’ll start with the script that executes Flyway Teams to do the ‘variant build’ process, using our baseline migration script. This is very similar to any other code for executing Flyway. You can use your preferred method such as using environment variables.

The PowerShell Callback to insert the data

This is the afterVersioned_ImportData.ps1 callback script and so Flyway will execute it after every versioned migration script that runs successfully during a Migrate. In our example, it will run after the B1.1.12 build script completes.

The first part of the script gathers the data we need, passed in from environment variables and various placeholders, and then the second part uses this data to execute some tasks from our DatabaseBuildAndMigrateTasks.ps1 library of functions, including getting the current version number ($GetCurrentVerison task) and then bulk copying in the data for that version ($BulkCopyIn task).

By keeping all the routines in a library file, it means all my callback scripts look very similar. It also makes it easy to add extra tasks into the callback rather than have more than one callback on the same event. See Scripting with Flyway Teams and PowerShell for more details about how this sort of callback script works.

I’ll present the code here, but by the time you read this, a more up-to-date version of this script is likely to be in the GitHub directory.

To output the BCP data

In our earlier scenario, the developers would now apply the migration script for the new feature, and then run all their tests to check out the new version. Once all this is complete, they can then can BCP it out to a directory so that it’s ready to use for other developers who need a quick build of the new version.

This is even simpler because it doesn’t have to call Flyway. To create a BCP Data directory you just execute this code. It checks the version of the database before it saves the BCP Data. Because we use the same logic to save the file as we do to read the file, we can be sure that the right table gets the data.

Conclusions

When databases grow up, the insertion of data can become a drag because it takes so long. Database builds soon become overnight builds. By instead inserting the data separately in a single process, you can increase the speed of database builds by orders of magnitude. This requires a slightly different development culture, but can bring benefits, especially in standardizing the data for each version, but allowing different datasets for testing a particular version.

The great advantage of the migration approach is that each version script migrates both data and metadata together. By the time the script is required to update a production server, it will be well-tested. The same isn’t true of development that is done purely with static database scripts like the code illustrated here, because this approach would require an additional script to update a production system from its current version to be identical to the new release version whilst preserving its production data. No database methodology can avoid using a migration for the update of the production database.

The Baseline script in this article takes the database from zero to 1.1.12. In reality, these can only be used for doing just that, as required for provisioning development or test databases from zero rapidly up to that version. You will always still need to rely on migration scripts for updating existing versions of the database.

Static scripts generally don’t include data insertions except possibly for enumerations that never change. Metadata is clearly distinguished from data. Migration scripts take a more liberal approach to this, in that they allow more freedom, and makes it easier to fit the tool to team practices, the requirements of the test team, the scale of the database, the legal framework in place for the data, and the security requirements. To move to an approach of bulk-loading data with Flyway, it requires more discipline in the way that data is loaded and saved, and we’ll go into this a bit more in a later article.

 

Tools in this post

Flyway

DevOps for the Database

Find out more

Flyway Teams

Ideal for organizations looking to improve collaboration and fine tune their processes during development and the deployment of database changes.

Find out more