Product articles SQL Toolbelt Generating Realistic Test Data
The Joy of Realistic Generated Fake…

The Joy of Realistic Generated Fake Database Data

Have you ever had to demonstrate a database-driven application, and longed to have the real data to do so? To do what, precisely? Well, so you can then scroll through the customers who have used the system, demonstrate the accounting and audit, browse through the products, maybe even demo the customer tracking system with comments, complaints and so on. All this is possible, using realistic, fake data.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

What is fake data?

Fake data is sample data that looks as much as possible like real data but in fact has no relationship at all with the real data; it is generated entirely randomly. This type of development and test data is sometimes also called ‘spoof’ data, generated data, random data, or dummy data.

It isn’t just databases that need these sample data sets: most applications can only be delivered if the development team has access to realistic, but fake, data. Testers need sample test data. It must, usually, be reasonably consistent, in that zip or post codes relate to the correct areas, states, regions or cities. It can be made to conform in its size and distribution to the real data, especially with any indexed columns. If you get it right, you can use this fake-but-realistic data in a variety of contexts. It is not just to create a table with sample data, or to insert dummy data into a database, during development. In fact, it is not only development, but also the vast majority of testing that can use it, though there will always be a few tests that must be done on the live data, within the secure setting of the Staging environment. Other aspects of deployment, such as training, user acceptance work and management presentations can also use this fake data, if it is believable and consistent. People don’t like simple-minded random gibberish, and are unsettled by data that isn’t realistic, such as a price that is obviously wrong or a product metric that seems unusual. Intellectually, they know the reason for it, but their gut instinct tells them it’s wrong. They generally hate sarcastic or silly data: for them, proper use of data isn’t compatible with frivolity.

Anonymized or pseudonymized data

These alternative types of sample data are different in that they start from the real data and then either substitute, modify and scramble the data to the point that it is impossible to identify the data subject (anonymized), or simply replace only the directly identifying data elements with fake values (pseudonymized). Pseudonymized data comes enough caveats that it’s not useful for most development work, but is essential for downstream reporting from PII data, and for providing research data. Anonymizing the data of a large or complex database can a difficult task but, assuming the resulting data retains the general characteristics and overall distribution of the real data, can be useful for development and testing work. See: Masking Data in Practice for details.

 

Generally, it is a bad idea to develop applications using the real data, outside Staging, because it slows down development. I have, at times, had to work with highly sensitive data. It isn’t so easy. For a start, anyone working with highly sensitive information must be properly ‘security checked’. Once, I was told that an affable ex-cop private investigator had phoned my old colleagues, friends, even universities I’d attended. I was usually obliged to work within the data center in a concrete-walled office. I was once cheerfully told that, if fire ever broke out, it could be rapidly filled with CO2, lethal to all forms of life, even an earthworm. However, with entirely generated random data, one could work on the database at home, and leave the laptop on the train without worrying about a data breach.

Making data generation part of a DevOps application development

Fake data is needed throughout the development process. In the early stages of development, web developers can, by using it, work independently on many of their tasks. Test engineers can work on their requirements and processes while determining their special requirements for tests such as scalability tests (volume of data is far more important than believability for this). It is best to have a scripted process for producing the fake data that grows as the data model for the application grows.

Rather than different activities in a development project, such as test and security, devising their own fake or sample data, it generally makes sense to do it once in a way that will meet most requirements because it takes time and effort to do data generation thoroughly and effectively. Whatever the data platform you will ultimately use for your application, it is often useful to create a data set on SQL Server, or whatever platform has the existing means to do it properly, and then port it to the intended platforms (see Realistic, simulated data for testing, development and prototypes).

By doing it once and doing it right, you can save a lot of delays and wasted time later during deployment and testing. It will help with user-training for corporate applications as well. From SQL Server tables, it is possible to export data in a rich variety of formats, such as for testing a JSON document database like MongoDB.

Fake data for development and reporting

For development work, it’s best to have completely spurious fake data, generated randomly, even if only a small part of the real data, if available at all, is confidential or sensitive. Most of this fake data will be hosted in a database, though it is often possible to use other ODBC sources. Until the first release of an application, you don’t even have the alternative of using real data instead, even if it were a good idea.

You don’t necessarily need all the data from the start: data is often used to test just parts of a database. If you are, for example, developing a view or table-valued function, and all the generated data doesn’t exist already, it makes sense to just fill the tables that the object depends on.

For developing reports for an existing application, it is sometimes sufficient to generate data for just those columns in a database that contain PII (see Pseudonymizing a Database with Realistic Data for Development Work). This will be the best way of providing data for downstream reporting systems such as OLAP, or for providing data for medical research.

This flavor of ‘pseudonymization’ is usually done when maintaining or enhancing an existing application, where it is important to keep the distribution of data the same as the live data. This is particularly true of performance testing. You leave the primary and foreign keys intact but change as much of the other data as possible, including all the sensitive and PII data. As discussed earlier, one must be careful to check that the result cannot be used for an inference attack. If, for example, you over-write the name of the patient, but leave the address, and it is a rare complaint, that allows someone to use other public databases to narrow down the possibilities sufficiently to make an identification. Sometimes, you actually have a legal requirement to ‘pseudonymize’ data so that it can be used as evidence or for audit. Here you need to obscure or change just part of the data but must leave that which is required.

There are several techniques that can be adopted to protect data, when it is not possible, outside Staging, to use unmasked production data for running tests. The use of completely fake development data is just at one end of a spectrum of techniques that allows the development team to do as much testing as possible as early as possible in development.

How to go about generating data?

The main task of a SQL data generator is to generate realistic-looking data while maintaining the correct relationship between tables, as specified by the FOREIGN KEYs and PRIMARY KEYs. It needs to know the number of rows required for each table. For accuracy, you always need to specify the nature of the relationship between data entities such as customers and accounts, or publications and publishers; it can’t be divined from the metadata. Although it can guess at the nature of the data from the datatype and CHECK constraints, it needs help in getting the data right, especially the text data, such as names, addresses, or longer-form text such as notes or reviews, as well as other types of data such as dates, and so on.

Whatever tool you eventually use for generating data, you will need, for a start, lists for generating text-based data. It could be names of departments, typical first-line addresses, colors, first names, last names, names of fruit, countries, types of paint, monarchs of Britain, types of mammal, ad infinitum. It is always far simpler to allow the generator to pick a line from a file at random to fill a table column.

If you are maintaining just one or two databases, getting lists of your domain data is unlikely to be a problem because you will know a lot about your data domain. The AdventureWorks developers would, for example, get to know a lot about bikes, their brakes, and pedals. If you are a seasoned database developer, you’ll have a large store of lists or you’ll know where to get them. There are even books of lists from which you can source data. That’s not enough. As well as creating valid credit card numbers or other IDs, you will probably need to be able to generate large blocks of text. The Waffle Generator is a typical example of this. This was taken to SQL with the IT worker’s Excuse generator. Far better than this is to allow the generating system to sense the real data and learn to mimic it realistically. To do this, it is possible to use Markov chains: I’ve described a way of using three-character trigraphs to generate spoof text. In 2010, I showed how to generate text in a SQL based application on large samples of text from books, to produce parodies on Homer, Shakespeare and Mark Twain. This showed how to use word trigraphs. Although the examples were chosen because I was amused by them, the real objective was data generation. Instead of Shakespeare, it might be reviews of wine, descriptions of clothing, or instructions for assembling wardrobes, or whatever is relevant to the application you’re developing.

These techniques can be used within SQL Data Generator, as I’ve shown in the article How to Complain Bitterly using SQL Data Generator. The tool allows you to do a lot of text generation by using a Reverse Regex interpreter that is surprisingly powerful.

Consistency within a row

It is a mistake to try to tackle the problem of generating data without tying in all the associated data in the row. A female customer should not, for example, have a name in our data that is related culturally to just men, such as ‘Nigel’. A price should be related to the product, a date of death must be after the date of birth. An address must have a city in the state or region specified in that column, and the Zip/post code must be right. Some relationships are subtle enough to need to be spelled out by a business analyst, whereas others are so obvious to any human that the rules are seldom written down.

To be able to start getting this right, you have to filter the type of content from which you make a random choice, and that filter will need to depend on the data already in another column. If, for example, the product code is for a suspension spring, the description should be for a suspension spring. To do this, the generator must be able to at least select an item on a list depending on a condition. In turn, this means that it must be possible to specify the order in which the columns are generated so that the columns that determine the content in other columns are generated first. You might decide that choose to select the country of an address first, for example, or the gender of the person. SQL Data Generator has a Python generator that allows you to reference other columns in the same row and looks after the problem of generating in the correct order.

Verisimilitude in foreign keys

The foreign key references are the hardest to get right. Taking the previous example of the suspension spring, you’d be likely to have a foreign key reference to a current price, or a supplier. These must seem to be right too. I once demonstrated a website ‘shopping’ application for a user who spotted a wild price for a product. He never really recovered his equilibrium for the rest of the demonstration.

Sometimes the science of data generation goes to the dark side, and is used for faking scientific experiments, generating spam, or giving spurious accuracy and completeness to hurried or botched research. It is this sort of inconsistency that finds them out and illustrates that this is the hardest part of generating data. NHS hospitals, for example, must report on outcomes and it is surprising how much gynecological intervention and surgery is apparently done to men. For a statistician, it is fascinating to see how fraud can be detected in generated data.

Automating data generation

Although it pays to get a system in place at the start of development, the system that you use must be sufficiently flexible to allow you to keep the data in sync with changes to the metadata. As you make changes to tables, you need to change the data to match. This means that each version of a database should have its own data, or else you must work exclusively by migration scripts rather than build scripts.

I show how to automate a SQL data generation in my article, Automatically filling your SQL Server test databases with data. To see how to do this for a whole range of development machines, see Automatically build-and-fill multiple development databases using PowerShell and SQL Data Generator. Basically, you use SQL Data Generator interactively when it is necessary to make a change to the generation, and use a script to fill all the necessary databases, using the SQL Data Generator project file and changing the target as necessary. This way, the actual filling of a database can be done automatically after a build as part of a CI process, even if the preparation of the new version is done outside office hours.

My article, How to create and refresh development and test databases automatically, using SQL Clone and SQL Toolbelt, shows how to use PowerShell, to build a database from object-level source, stock it with data, document it, and then provision any number of test and development servers using SQL Clone.

Conclusions

Society at large is increasingly disinclined to allow personal data to be used for developing applications. The task of creating fake, random or sample data is tedious, and might seem unproductive but once it becomes easy to provide all developers with as much as they need of all the types of data in the application domain, then development work can be far more rapid. From my own long experience leading application development projects, I like to avoid any issues with compliance, data protection or security. Timely development is hard enough without the extra meetings and paperwork. If you no longer need any production data for development work, then life can become less complicated.

The same goes for the problems of providing data for reporting or analysis. If it can be successfully anonymized, it makes life easier for the people charged with data governance in the organization.

Articles referenced

Tools in this post

SQL Data Generator

Generate realistic test data for SQL Server databases

Find out more

SQL Toolbelt

The industry-standard tools for SQL Server development and deployment.

Find out more