Build and fill a database using JSON and SQL Change Automation

Phil Factor demonstrates how to export data from a database, as JSON files, validate it using JSON Schema, then build a fresh development copy of the database using SQL Change Automation, and import all the test data from the JSON files.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

When you are developing a database, and doing regular builds from Source Control, you must produce a working database, of course, but it also needs data. The data you use will vary according to the type of tests you need to run. Integration test runs generally use one or more standard datasets so that, so that you can test each process, such as an end-of-day reconciliation, with a set of standard inputs and check that for a given input, your output is always the same. When you are doing performance testing you need lashings of data, of the same distribution of the real data. If you are doing user-acceptance testing, the data needs to look real, and so on. You could end up needing several sets of data. Loading the data manually can be an awful chore so it is best to automate the process.

There are several ways of maintaining this data. In my experience, it is usually kept in files, which generally come from a database, or series of databases. The first task is to synchronize the source database with the current build, in case table structures have changed. We can then export the data into a file directory (one table per file), and the imported into the development copy of the database during the build process.

BCP native format is fastest way of importing or exporting data, of course, but can’t be used if the target database changes its table schemas. We can use CSV, but SQL Server is puzzlingly bad at CSV, and it is easy to get problems. As I now tend to use MongoDB for building data sets, I’m increasingly using JSON import. If it is used with JSON schema, then the data can be checked and validated well before the build, so that success at the stage of data import is well guaranteed.

The step-by-step approach

To demonstrate the principles, I’ll do it step by step. We’ll start with a copy of the PUBS database, on the same server as the test database, containing all the data we need. I also have a utils database containing the various utility procedures we need to do the leg-work of the data export-validate-import process. As we just use stored procedures, we can merely use temporary stored procedures to do the same thing.

I’ll export the data from the Pubs database, as JSON files, where each file represents the intended contents of a table, stored in a standard ‘object-within-array’ format. I’ll validate the data using JSON Schema, build the development copy of the database using SCA, and then import all the data from JSON files.

Exporting data to JSON

We need to start with a directory containing the source of our data. We’ll use the venerable Pubs database (in fact, a modified version, with plenty of data in it) just to illustrate. I’ve provided the source in the accompanying Pubs.zip file).

Additionally, we have the contents of PUBS, greatly augmented from the original, stored as JSON files.

In my example, I exported the data via MongoExport with optional array brackets and commas. As an alternative, we can easily use a SQL Server with the data to provide the JSON data files. I usually do this from PowerShell purely because of its superior file handling, but let’s see how to do it in SQL. The @path directory must already exist on the target server (so, in this example, ‘C:\data\RawData\PubsData’).

I provide the source of these procedures here. They need to be compiled in on the same connection, which means in the same SSMS query window, or on the same connection in PowerShell.

Saving this data should be a trivial matter but there is a complication in that some of the CLR datatypes can cause an error, so the code is slightly more complicated than you might want. We need to output this in UTF8.

Building the database with SQL Change Automation

We can build the database easily enough on our SQL Server. First, we create the database.

Then, with SQL Change Automation (SCA), we can very quickly build the database. In this case, I’m using SQL Server authentication, but Windows Authentication is even easier.

Checking the Data

If you are feeling lucky, you can skip this stage, which is to check the JSON import files to make sure that they have good data. Sometimes, if you are masking data, or generating new sets of data, or pulling data in from public sources, you’ll find that things can go wrong.

We can generate a JSON schema for the JSON files from our new database, because any SQL Server database knows what it can ‘eat’. We can easily get that information, and we can add additional information such as the SQL datatypes of each column and the column order of our tables. We want this JSON Schema because we can use it to check the data files to make sure that the data is all there for the columns we need, and generally of a type that can be coerced into a SQL Server datatype.

We can do this in SQL if we can use xp_cmdshell on the built database. Otherwise we can use PowerShell or any .NET scripting. I usually use PowerShell scripting because SQL Server is notoriously unreliable with producing or consuming the standard UTF-8 format that is standard for JSON files. However, we’ll show this in SQL. Again, the target server directory must exist.

I give the source of these routines here. We can now use PowerShell to run the checks. Note that if you are using MongoDB as your source, it adds its own additional primary key, of a particular BSON datatype. We won’t want it, so we simply ignore that data.

We use NewtonSoft’s JSON Schema Validator, which will flush out any problems if we specify the checks in the JSON file.

I’ve arranged the data in two directories, schema and data, and the files have the same name in each (see Pubs.zip). The path to this is defined by the variable $OurDirectory.

Execute this PowerShell on your server:

After a bit of sorting out of issues with the data, we get

The data of dbo-authors.json is valid against the schema
The data of dbo-discounts.json is valid against the schema
The data of dbo-employee.json is valid against the schema
The data of dbo-jobs.json is valid against the schema
The data of dbo-publishers.json is valid against the schema
The data of dbo-pub_info.json is valid against the schema
The data of dbo-roysched.json is valid against the schema
The data of dbo-sales.json is valid against the schema
The data of dbo-stores.json is valid against the schema
The data of dbo-TagName.json is valid against the schema
The data of dbo-TagTitle.json is valid against the schema
The data of dbo-titleauthor.json is valid against the schema
The data of dbo-titles.json is valid against the schema

Importing the data

So now we have the data and we have the database. What can go wrong?

  • We need a bit of fiddling because the OPENJSON function can’t have certain deprecated datatypes in its ‘explicit schema’ syntax,
  • OpenJSON cannot use some of Microsoft’s own CLR datatypes.
  • We must cope properly with identity fields.
  • We need to temporarily disable all constraints before we start and turn them back on when we finish.

I generally use PowerShell for this: We can import this via SQL but here I’m doing so purely to illustrate that it can be done, more or less. The worst problem I found was that it is impossible to read in a UTF8 text file (the standard for a JSON file) and preserve the characters! You can do it in BCP but not OPENROWSET for some reason. This is said to work but doesn’t:

Fortunately, this database is from the old world of ASCII, so we can do the best we can in SQL. We execute the following SQL

The source of the procedure that does the real work is attached to the article and stored here.

A scripted approach

Here we have a script that creates as many databases as you need on as many servers as you want, using whichever data sets you choose. For each database, you can provide a source code directory, a JSON data directory (JSONData), and a JSON schema directory (JSONSchema). The JSONData directory must correspond to the database, in terms of the tables and their metadata.

In this example script, I’ve given the information for the databases, servers and login in a PowerShell data structure. Usually, I turn this into a PowerShell cmdlet that takes the config as a filename. However, this script form allows much easier debugging and change. The current data structure is to give a minimal example that you can build on.

The script works in a similar way to the manual approach, but I create a temporary stored procedure to do the hard work, and use SMO for the slow dignified bits. This is a more satisfactory approach, if only because one can read in the JSON files within PowerShell and they can be anywhere visible to your workstation on the network. By using just temporary stored procedures for the heavy lifting, we keep the database and server clean without a lasting trace of our activities.

I have not embedded the routine that tests the JSON data, because I usually do that separately. Also, the generation of the JSON schema and the JSON data is normally a separate process. As you can imagine, the generation of the schema must be done by doing a database build from source control. It need have no data in it at all. The schema can then be generated and then tested against the data. Normally, one would wish to add checks to the schema, such as Regexes. (wonderful for the checking of dates, email addresses and the like).

And after it runs we can easily check that all is OK with SQL Data Compare!

Conclusions

Storing your data for development work as JSON confers several advantages. You can exchange data with any application or service that can consume JSON data, you can edit the data easily in a JSON editor, use lightning fast data masking that doesn’t leave a trace in the database, and you can very easily validate the data before you insert it. These validation checks include some that can’t be done in SQL Server, such as RegEx checks. JSON can store tabular data in several formats, some of which are almost as economical as CSV. I haven’t noticed data import to be any slower than CSV.

JSON seems to have suddenly grown up.

Tools in this post

SQL Change Automation

Automate your database changes with CI and automated deployment

Find out more

SQL Data Compare

Compare and synchronize SQL Server database contents

Find out more

SQL Toolbelt

The industry-standard tools for SQL Server development and deployment.

Find out more