How to Generate Fake Test Data for SQL Server
How to use a SQL random data generator to fill SQL Server tables with realistic test data, to shift left database unit testing, integration testing and performance testing so that it is performed during the early stages of database development.
Developers need a fake data generator that they can use to produce test data sets that match the characteristics and distribution of the real data as closely as possible, so that they can perform effective test-driven development, for databases, right from the start of the development cycle. For certain types of tests, such as stress and load testing, they will also need to be able to generate test data in volumes comparable to future production requirements. This article takes a first look at how to use Redgate’s SQL Data Generator as a random data generator to populate SQL server databases with fake but realistic data.
Shift left testing
Many organizations invest heavily in their User Acceptance Testing (UAT) environments, setting up infrastructure that mimics the production environment, and often using replicated production data. This is important because it means they can perform acceptance testing as well as security testing, performance testing, error-handling testing, and stress / load testing in production-like conditions, and so be confident they’ll see behavior comparable to what will be seen in production.
The downside is that it means that a lot of essential database testing is often delayed until “the last stop before production“, and subsequently this can uncover a lot of nasty surprises late in the deployment pipeline.
One way we can “shift left” database testing, so that it is performed as early as possible in the development cycle, is to give developers and testers access to data that reflects as accurately as possible the volume and nature of the real data. However, it is often not possible, either logistically or legally, to replicate vast volumes of production data in development and test servers. In such cases, the best solution is to use a test data generator, such as SQL Data Generator that can produce realistic but randomly-generated data, and is highly configurable to suit the demands of the data and the required tests.
Create the Customers database
We’re going to take a look at how SQL Data Generator goes about generating realistic test data for a simple “Customers” database, shown in Figure 1. We’ll also take a first look at the options available to customize the default data generation mechanisms that the tool uses, to suit our own data requirements.
To create the Customers database and the sample tables, the script is available to download at the end of this article.
Connect SQL Data Generator to the test database
First, download SDG. It comes bundled into “SQL Toolbelt Essentials” and during the install process you simply select only the tools from the toolbelt that you need.
Start up the tool, and on the Project Configuration screen, enter the SQL Server instance name, the authentication mode, and select the Customers database.
In the Scripts tab, we can define in scripts any tasks we need to perform pre- and post-data generation, and in the Options tab, for data generation, we can define whether or not we wish to fire INSERT
triggers, enforce CHECK
constraints, and define the batch size in which SDG generates rows. We can also opt to enable DELETE
triggers when we regenerate data (SDG will automatically delete all existing data first).
Instant fake data
We’re just going to connect to the database and when we hit OK, SDG creates a new project; it analyzes the column names and data types of all the tables, as well as all keys and other constraints, in each of the tables in the Tables to Populate list.
For each table in the Tables to Populate list, the top half of the right hand side of the screen shows the source of the data and number of rows to generate for each table, and the bottom half shows a preview of the first 100 rows of data generated; SDG has automatically chosen what it thinks is the most appropriate built-in data generator for each column, and generated some sample data.
Choosing a Data source and number of rows
Generally, we’ll populate each column using one of the built-in generators, and so use the default Generate data option. However, especially for lookup data, we may want to simply import the data for that table and so select use existing data source, such as a table or a CSV file.
The next question is: how many rows do we need to generate for each table? For example, is it adequate to run tests with 1000 customers (the default) in the Customers.Person
table?
Maybe, but it depends. Consider that a company like Amazon adds about 30 million customers per year, which is an average daily load size of 82,191.78 rows. This means that depending on the type of testing, you’ll want to test against a minimum of 100K rows.
By default, SDG will delete all existing data and regenerate it each time.
The built-in data generators
As noted above, SDG auto-chooses what it thinks is the best generator for each column, based on the column name, data types and taking into account any applied keys and constraints.
By default, SDG uses a different seed value for each generated column of data. Each unique seed value produces a different sequence of values. The seed for the first generated column of the first table in the Tables to populate list is 0 and this value increments by 1 for each subsequent column. We can have a maximum of 1024 columns in a normal table, so the seed for the first column of the second table in the list will be 1024. If we assign the same seed value to two columns, and all other properties of that column are the same, then SDG will generate identical data for each of those columns.
There are a large number of generic built-in generators, plus the ability to write our own generators. I won’t cover custom-built generators at all in this article, and I am not going to attempt to summarize all or even most of the built-in generators – they are all listed in the tool documentation. However, I will cover the basics of data generation for key columns, plus review some of the common generators for date and text columns, in our Customers
database, and also how we might start to tweak them to better mimic our real production data. You will also want to take some time to explore how to configure data generation for other data types, such as int
, decimal
, and bit
.
Simply click on any column in a table, in the Tables to Populate list, or on the column header in the data preview pane, to reveal the selected built-in generator for that column.
Populating key columns
Examining the very first table, Customer.Abode
, SDG has recognized that Abode_ID
is an IDENTITY
column and will use “server assigned” values. It has also recognized each FOREIGN KEY
and uses for each an automatic Foreign Key (automatic) generator. There is also a Foreign Key (manual) option if you’re unlucky enough to be in a situation where the keys are defined outside the database.
All of the FKs in this tables will, of course, be generated to match values found in their respective PK columns. Since Person_ID
and Address_ID
have identical characteristics, and the same seed value, so they will contain the same values in each row. To change this, you can increment the seed value for one of the columns to 1.
Notice that based on our knowledge of the distribution of data in these key columns in the production data, we can tweak the distribution in the generated data in terms of how many NULL
values appear in the data, how often column values repeat in the child table, and so on.
Generating Dates
All of the date columns in our Customers
database have the datetime
datatype. In each case, taking the start_date
and end_date
columns in the Abode
table as examples, we have the opportunity to influence the range and distribution of the data in the table.
SDG has selected a datetime
generator for this column, which of course is sensible! Since start_date
is the fifth column in the first table, the data generated for this column has a seed value of 4, and the end_date
column will have a seed of 5. Before making any changes, reduce the seed for end_date
to 4 and you’ll see that it now contains identical data to start_date
.
Since this column was defined to allow NULL
s, the Allow null values check box is checked. We can influence the proportion of NULL
values in the column and the minimum and maximum date range. Tweak the values as you wish and then click on the end_date
column. Instead of using a Min/Max date range, you might want to use an offset from the start_date
column. I’d also expect the proportion of NULL
values to be higher (indicating a current address).
Alternatively, you could choose to generate data for end_date
using the cross column Data Add generator.
By default, this will just add exactly 100 days to the start_date
, but behind this is a python script, and by customizing that code, you can exert more granular control over the relationship between the dates in these columns.
For a more detailed example on generating realistic text data, see Generating realistic dates using SQL Data Generator and Python.
Generating text (strings)
For textual columns, SDG offers an array of generic, personal, geographical and other generators.
As you can see, we can generate data by importing from a list, using a regular expression, using a SQL statements, using a Python script and more. Here, I’m only going to cover two of the most common: regexes and import from list. Many of the built-in generators for geographical, personal and other types of data use these techniques.
For more detailed examples on generating realistic text data, see:
- How to generate realistic text data using SQL Data Generator
- Generating test data with localized addresses using SQL Data Generator
- How to Complain Bitterly using SQL Data Generator
Generating random text using Regular Expressions (RegExs)
SDG often uses its core Regular Expressions generator to define the string data generation. There are some good examples of this in the Customer.Address
table, shown in Figure 9 above (before generating that figure, I swapped out the chosen generators for some columns in order to generate data that consistently used US-only addresses). The regex used for the ZIP Code generator is very simple, as shown in Figure 10.
The expression, [0-9]{5}
, simply means “pick a random number between 0 and 9, five times“. The regular expression used for this column is stored in an XML file in C:\Program Files (x86)\Red Gate\SQL Data Generator 3\Config (in this case called ZIP Code.xml).
We have other options on this screen:
- The Add Snippet… button offers various common regex “phases” as built-in snippets and can help speed up building these regexes
- The Insert File List… button allows us to import data from a file, then use it as part of the regex.
- The Insert Table Column… button allows us to include in the regex data from another column
The regexes for the address columns are more complex, but still reasonably straightforward once you know your way around them. Phil Factor’s article does a great job of explaining these regexes and showing how to customize them.
Try out what’s possible. For example, if you want buildings named after colors rather than presidents, in AddressLine1
, just swap the (Lincoln|Washington|Jefferson…) Building part of the regex with an insert from a file list: (($”Color.txt”) Building).
Once you’ve customized the regex exactly as you want, you can choose Save as, to save the custom form of this instance of the generator, so that you can reuse it in other projects.
It saves the XML file to a specific location (see Tools | Application Options to find out where). When you open a new project, you’ll see your custom version is available in the User Defined section.
Importing text data from a list
Another very common way to populate a text column, such as names of people, cities, countries and so on, is simply to import values from a list.
If you take a look in C:\Program Files (x86)\Red Gate\SQL Data Generator 4\Config, you’ll find all the built-in lists, but we can easily add our own custom lists of data.
Customizing data generation
As you examine the various columns, you’ll spot some cases where SDG “gets it wrong”. In the Customers
database, for example:
TypeofAddress
column inCustomer.AddressType
– SDG has chosen an AddressLine generator, but we probably just want to import appropriate values, such as ‘work’, ‘home’ or ‘delivery’, from a listTypeOfPhone
column inCustomer.PhoneType
– same problem as above. SDG has chosen a Phone Number generator, whereas we’re after values such as ‘Mobile Work’, ‘Office landline’ and so on.DiallingNumber
inCustomer.Phone
– this should just be a phone number.
If you don’t find what you need in one of the alternative built-in generators, you may need to write your own, or grab and customize an existing custom generator – a topic for another article!
Generating the fake but realistic data
Once we have defined how each column should generate data, we can generate data for the entire table. As discusses earlier, we need to decide how many rows are appropriate for each table, maybe 50K rows for Address
and Person
tables, 75K for Abode
and Phone
(to allow for people moving home, changing phone number) and as appropriate for the other tables.
Figure 12 shows a screenshot for a couple of the tables, with sample rows generated for the Abode
and Address
tables.
There are quite a few other considerations to take into account, which we won’t have time to explore in detail here, but for example:
Often, especially when loading a data warehouse, you’ll want to load some tables with data and not others. In such cases, you may need to disable the foreign key constraints and re-enable them after we finished loading the test data.
It’s also common to disable and re-enable indexes before and after data loads
If you use DELETE
triggers and you choose to have the tool delete rows when a truncate cannot be done, beware that the delete triggers will fire on every delete.
Summary
If you want to run database tests that extend beyond basic unit and integration tests, earlier in the development and testing lifecycle, but you can’t use real production data, then a tool like SQL Data Generator may provide a solution. It will allow you generate quickly data that reflects reasonably accurately the nature of your real production data, and in volumes sufficient to perform realistic data load and performance testing.
For more articles like this on SQL Data Generator, see: https://www.red-gate.com/hub/product-learning/sql-data-generator.
Originally published: 17/11/2016