17 November 2016
17 November 2016

How to start producing realistic test data with SQL Data Generator

Many organizations invest heavily in their User Acceptance Testing (UAT) environments, setting up infrastructures that mimic the production environment, often using replicated production data. This is important because it means they can perform acceptance testing as well as security testing, performance testing, error-handling testing, and stress/load testing in production-like conditions, and so be confident they’ll see behavior comparable to what will be seen in production.

The downside is that it means that a lot of essential database testing is often delayed until ‘the last stop before production’, and subsequently this can uncover a lot of nasty surprises late in the deployment pipeline.

One way we can ‘shift left’ some of these database testing processes, so that they are performed as early as possible in the pipeline, is to give developers and testers access to data that reflects as accurately as possible the volume and nature of the real data. However, it is often not possible, either logistically or legally, to replicate vast volumes of production data in development and test servers.

So, what’s the answer? This article takes a first look at SQL Data Generator, a tool that I believe will give developers and testers a simple way to generate realistic test data in volumes comparable to future production requirements, and therefore to complete competent and diligent stress and load testing on that data, during the development and testing phases.

Unpacking SQL Data Generator

We’re going to take a look at how SQL Data Generator (SDG) goes about generating realistic test data for a simple ‘Customers’ database, shown in Figure 1. We’ll also take a first look at the options available to customize the default data generation mechanisms that the tool uses, to suit our own data requirements.

sql-dg-1

Figure 1

First, download SDG. It comes bundled into SQL Toolbelt Essentials and during the install process you simply select only the tools from the toolbelt that you need.

Next, create the Customers database and the sample tables. The script is available to download here.

sql-dg-2

Figure 2

Finally, start up SQL Data Generator, and on the Project Configuration screen, enter the instance name, the authentication mode, and select the Customers database.

sql-dg-3

Figure 3

In the Scripts tab, we can define in scripts any tasks we need to perform pre- and post-data generation, and in the Options tab, for data generation, we can define whether or not we wish to fire INSERT triggers, enforce CHECK constraints, and define the batch size in which SDG generates rows. We can also opt to enable DELETE triggers when we regenerate data (SDG will automatically delete all existing data first).

We’re just going to connect to the database and when we hit OK, SDG creates a new project; it analyzes the column names and data types of all the tables, as well as all keys and other constraints, in each of the tables in the Tables to Populate list.

sql-dg-4

Figure 4

For each table in the Tables to Populate list, the top half of the right hand side of the screen shows the source of the data and number of rows to generate for each table, and the bottom half shows a preview of the first 100 rows of data generated; SDG has automatically chosen what it thinks is the most appropriate built-in data generator for each column, and generated some sample data.

sql-dg-5

Figure 5

Data source and number of rows

Generally, we’ll populate each column using one of the built-in generators, and so use the default Generate data option. However, especially for lookup data, we may want to simply import the data for that table and so select use existing data source, such as a table or a CSV file.

The next question is: how many rows do we need to generate for each table? For example, is it adequate to run tests with 1,000 customers (the default) in the Customers.Person table?

Maybe, but it depends. Consider that a company like Amazon adds about 30 million customers per year, which is an average daily load size of 82,191.78 rows. This means that depending on the type of testing, you’ll want to test against a minimum of 100K rows.

By default, SDG will delete all existing data and regenerate it each time.

The built-in data generators

As noted above, SDG auto-chooses what it thinks is the best generator for each column, based on the column name, data types and taking into account any applied keys and constraints.

By default, SDG uses a different seed value for each generated column of data. Each unique seed value produces a different sequence of values. The seed for the first generated column of the first table in the Tables to populate list is 0 and this value increments by 1 for each subsequent column. We can have a maximum of 1,024 columns in a normal table, so the seed for the first column of the second table in the list will be 1,024. If we assign the same seed value to two columns, and all other properties of that column are the same, then SDG will generate identical data for each of those columns.

There are a large number of generic built-in generators, plus the ability to write our own generators. I won’t cover custom-built generators at all in this article, and I am not going to attempt to summarize all or even most of the built-in generators – they are all listed in the tool documentation. However, I will cover the basics of data generation for key columns, plus review some of the common generators for date and text columns in our Customers database, and also how we might start to tweak them to better mimic our real production data. You will also want to take some time to explore how to configure data generation for other data types, such as int, decimal, and bit.

Simply click on any column in a table, in the Tables to Populate list, or on the column header in the data preview pane, to reveal the selected built-in generator for that column.

Key columns

Examining the very first table, Customer.Abode, SDG has recognized that Abode_ID is an IDENTITY column and will use ‘server assigned’ values. It has also recognized each FOREIGN KEY and uses for each an automatic Foreign Key (automatic) generator. There is also a Foreign Key (manual) option if you’re unlucky enough to be in a situation where the keys are defined outside the database.

sql-dg-6

Figure 6

All of the FKs in this tables will, of course, be generated to match values found in their respective PK columns. Since Person_ID and Address_ID have identical characteristics, and the same seed value, they will contain the same values in each row. To change this, you can increment the seed value for one of the columns to 1.

Notice that based on our knowledge of the distribution of data in these key columns in the production data, we can tweak the distribution in the generated data in terms of how many NULL values appear in the data, how often column values repeat in the child table, and so on.

Date columns

All of the date columns in our Customers database have the datetime datatype. In each case, taking the start_date and end_date columns in the Abode table as examples, we have the opportunity to influence the range and distribution of the data in the table.

sql-dg-7

Figure 7

SDG has selected a datetime generator for this column, which of course is sensible! Since start_date is the fifth column in the first table, the data generated for this column has a seed value of 4, and the end_date column will have a seed of 5. Before making any changes, reduce the seed for end_date to 4 and you’ll see that it now contains identical data to start_date.

Since this column was defined to allow NULLs, the Allow null values check box is checked. We can influence the proportion of NULL values in the column and the minimum and maximum date range. Tweak the values as you wish and then click on the end_date column. Instead of using a Min/Max date range, you might want to use an offset from the start_date column. I’d also expect the proportion of NULL values to be higher (indicating a current address).

sql-dg-8

Figure 8

Alternatively, you could choose to generate data for end_date using the cross column Data Add generator:

sql-dg-9

Figure 9

By default, this will add exactly 100 days to the start_date, but behind this is a python script, and by customizing that code, you can exert more granular control over the relationship between the dates in these columns.

String columns

For textual columns, SDG offers an array of generic, personal, geographical and other generators:

sql-dg-10

Figure 10

As you can see, we can generate data by importing from a list, using a regular expression, using a SQL statement, using a Python script and more. Here, I’m only going to cover two of the most common: regexes and import from list. Many of the built-in generators for geographical, personal and other types of data use these techniques.

Regexes

SDG often uses its core Regular Expressions generator to define the string data generation. There are some good examples of this in the Customer.Address table, shown in Figure 9 above (before generating that figure, I swapped out the chosen generators for some columns in order to generate data that consistently used US-only addresses). The regex used for the ZIP Code generator is very simple, as shown in Figure 10:

sql-dg-11

Figure 11

The expression, [0-9]{5}, simply means “pick a random number between 0 and 9, five times“. The regular expression used for this column is stored in an XML file in C:\Program Files (x86)\Red Gate\SQL Data Generator 3\Config (in this case called ZIP Code.xml).

We have other options on this screen:

  • The Add Snippet… button offers various common regex ‘phases’ as built-in snippets and can help speed up building these regexes
  • The Insert File List… button allows us to import data from a file, then use it as part of the regex
  • The Insert Table Column… button allows us to include in the regex data from another column

The regexes for the address columns are more complex, but still reasonably straightforward once you know your way around them. Phil Factor’s article does a great job of explaining these regexes and showing how to customize them.

Try out what’s possible. For example, if you want buildings named after colors rather than presidents, in AddressLine1, just swap the (Lincoln|Washington|Jefferson…) Building part of the regex with an insert from a file list: (($”Color.txt”) Building).

Once you’ve customized the regex exactly as you want, you can choose Save as, to save the custom form of this instance of the generator, so that you can reuse it in other projects:

sql-dg-12

Figure 12

This saves the XML file to a specific location (see Tools | Application Options to find out where) and, when you open a new project, you’ll see your custom version is available in the User Defined section:

sql-dg-13

Figure 13

Import from list

Another very common way to populate a text column, such as names of people, cities, countries and so on, is simply to import values from a list:

sql-dg-14

Figure 14

If you take a look in C:\Program Files (x86)\Red Gate\SQL Data Generator 3\Config, you’ll find all the built-in lists, but you can easily add your own custom lists of data.

Sometimes SDG gets it wrong

As you examine the various columns, you’ll spot some cases where SDG gets it wrong. In the Customers database, for example:

  • TypeofAddress column in Customer.AddressType – SDG has chosen an AddressLine generator, but we probably just want to import appropriate values, such as ‘work’, ‘home’ or ‘delivery’, from a list
  • TypeOfPhone column in Customer.PhoneType – same problem as above. SDG has chosen a Phone Number generator, whereas we’re after values such as ‘Mobile Work’, ‘Office landline’ and so on
  • DiallingNumber in Customer.Phone – this should just be a phone number

If you don’t find what you need in one of the alternative built-in generators, you may need to write your own, or grab and customize an existing custom generator – a topic for another article!

Generating the data

Once you’ve defined how each column should generate data, you can save the project and run it to generate data for the entire table. As discussed earlier, we need to decide how many rows are appropriate for each table, maybe 50K rows for Address and Person tables, 75K for Abode and Phone (to allow for people moving home, changing phone number) and as appropriate for the other tables.

Figure 12 shows a screenshot for a couple of the tables, with sample rows generated for the Abode and Address tables:

sql-dg-15

Figure 15

There are quite a few other considerations to take into account, which we won’t have time to explore in detail here, but for example:

  • Often, especially when loading a data warehouse, you’ll want to load some tables with data and not others. In such cases, you may need to disable the foreign key constraints and re-enable them after loading the test data.
  • It’s also common to disable and re-enable indexes before and after data loads
  • If you use DELETE triggers and you choose to have the tool delete rows when a truncate cannot be done, beware that the delete triggers will fire on every delete

Summary

If you want to run database tests that extend beyond basic unit and integration tests, earlier in the development and testing lifecycle, but you can’t use real production data, then a tool like SQL Data Generator may provide a solution. It will allow you quickly generate data that reflects reasonably accurately the nature of your real production data, and in volumes sufficient to perform realistic data load and performance testing.

Many thanks to Ryan Casey for his contributions to this article. Check out his blog post, where he describes how he uses a tool like SDG as part of a broader enterprise strategy for data loading and database testing.

Tools in this post

SQL Data Generator

Create large volumes of realistic test data with a couple of clicks in SQL Server Management Studio.

Find out more

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Related posts

Also in Hub

Tackle common database problems more effectively

While the database is sometimes seen as the blocker in the development process, it doesn’t have to be. With the right knowledge and tools, you can stop problems before they occur, help to speed up d...

Also in Product learning

Get a lot more value from your tools

From provisioning databases through to version control, automated deployments and monitoring, Redgate has a tool for every part of the database development process.

You probably have a favored tool b...

Also in SQL Data Generator

Basic data masking for development work using SQL Clone and SQL Data Generator

This article describes a lightweight copy-and-generate approach for making a sanitized version of a production database available to development teams with SQL Clone and SQL Data Generator.

We build ...

Also about SQL Data Generator

Automatically build-and-fill multiple development databases using PowerShell and SQL Data Generator

When you are working with a database, you always need data. This is why you need to stock the database with data after you build it. Sometimes, you just want a large number of made-up customer details...