Data generation is the science and art of providing data for database development work that is as realistic and controllable as possible. The skills of generating realistic data are an essential part of being a database developer. It is important: you need plenty of data of exactly the right type, size and verisimilitude in order to develop and test a database. It has got to look right. If you are launching a corporate, data-driven application, you’ll need it for training as well as testing and development.
In the most active phase of development of the database, before the initial release, you can’t take the “easy option” of using the live production data because there isn’t any. When upgrading a production system, you can’t always use live data because it is likely to get too large in size to be useful and will probably have information in there that you can’t legally use until it is anonymized, because it is private or sensitive.
The generation of data can be scripted, but it isn’t easy. The process seems easy on first glance, but the reality is different. There exist several Data Generation tools that are used to provide development data, but you need to be sure of your requirements before selecting a particular tool.
Why bother with data generation?
You might think that it is adequate to put random strings, numerals, binary objects and the like into your database. There are four good reasons why completely random data will generally be insufficient.
Firstly, a database that is properly designed will protect the database from being infected with bad data, such as invalid or impossible dates, prices out of range, invalid keys, nonsense addresses and improbable names. Whatever the constraints that are in place, the data you generate must conform with them. If, in apparent contradiction of this, all your nonsense data, or malicious data, can be inserted into the database without an error being triggered, your database hasn’t been designed properly. (One might think that. FirstName: #onc=2ldLDe3icnad LastName: Smith would be perfectly acceptable if that is the name that poor 2L was saddled with. However, a database must be protected from a string like this, if it was inserted maliciously or through data corruption because it can cause issues or be part of an attempt at penetration. Also, there are inevitably hidden rules about data that require a constraint: a column such as a product code for example is likely to be constrained by all sorts of hidden rules.)
Secondly, even if you were so ill-advised as to disable the check constraints, nonsense data still wouldn’t be much help for development work. You will run tests to make sure the SQL queries work, and you need sensible data to sure that the result is correct. Spotting errors will be far simpler if your list of names and addresses, a current shopping basket or parts list looks right.
Thirdly, if your data isn’t realistic, the distribution will be different. This means that the query plans used by the database will be entirely different from the ones used once the real data is in place. Your SQL that works fast in development may crawl when released.
Fourthly, the users of a corporate database application will need training and, from experience, I know that it is important for them to have data that looks right. Users get very distracted by details that seem wrong, such as a malformed part-number, whereas a lay person wouldn’t notice.
Why persist with data generation?
It always catches me with surprise when people want to use production data for database development work. I’ve heard developers argue that it is the only way to investigate certain bugs. Well, it is rare to require this but when I’ve had to do this in a corporate setting, I got the required security clearance and did the work in the staging environment within the DMZ on a restored backup, working within the Ops team. All essential IT operations can be achieved with care and ingenuity.
You will have heard of clever techniques that allow you to use production data, such as subsetting and pseudonymization. Subsetting can work but is only reliable when heavily-configured to provide the design details of the database that aren’t reflected by the metadata. Pseudonymization is prone to certain vulnerabilities such as inference attacks, data leakage or pattern analysis. For this reason, it cannot be used without encryption at rest.
Another important reason for continuing with data generation is that it helps enormously with testing. In fact, I’ve never seen an effective database test cycle being done without the use of generated test data sets, to check on the scalability of the database design. There is a great pleasure in spinning up a development database with as much, or as little data as you need without a thought or concern about security. With entirely-generated data, you can even work on the database at home.
An aspect of data generation is that it will provide for tests where there is unlikely to be any production data. There will be data entry and data import in almost all database systems, and this will need to be tested during development and test. This can involve the simulation of application-fed data or bulk entry from an upstream system. However, it is needed, it needs to look right and successfully navigate the validations that the database provides.
One aspect of data generation is often neglected, and that is the generation of input data that is deliberately designed to test a database’s constraints. This type of dataset is part of the Tester’s armoury, They test for issues that, by intention or carelessness, are likely to cause problems to a database. There are many lists collected over the years of strings that could cause chaos within the application if accepted into the database, or even security breaches. The best probably being the ‘big list of naughty strings’
Data Generation challenges
Data can be entirely generated. There are complications, though.
The biggest complication, and one that makes entirely-automated data generation almost impossible in many databases, is that any properly-devised database will have column-level and table-level check constraints that ensure that there is as little bad data as possible in the database. This means that a tool such as SQL Data Generator must examine the check constraints to gauge the rules specified. This is tricky enough when only SQL expressions can be used in check constraints, but some RDBMSs, such as SQL Server, allow functions to be included in these expressions. These must be reverse-engineered to understand the rules that must be followed to allow data to be successfully generated. Even so, it is quite possible that a trigger is used to enforce data rules as well. I’ve even known this to be done only within the application. However, a database with few constraints, usually because it relies on the applications and input feeds to enforce the rules of the data, is a danger to everybody using it; subject to security vulnerabilities, orphaned records, duplicate entries, and invalid relationships. A database without check constraints in particular cannot easily enforce business rules about data.
A second complication of generating data is that the data in different columns is often interrelated. A start date, for example, is invariably earlier than a termination date. Although things are more complicated nowadays, personal pronouns are generally related to first-names in Name-and-Address (NAD) tables. Prices of products are related to the cost of manufacture or supply. Some types of products are invariably cheaper than others – hardback books are pricier than soft cover books. Many of these rules require specialised knowledge of the domain, though they can be tested by table-level check constraints and triggers.
An interesting complication of data generation is the relative frequencies of the various types of entity being recorded. Sometimes these different types aren’t even defined explicitly. A NAD database used in Healthcare could, for example, easily have a predominance of one gender over the other without a column that defines this. When filling in the foreign key fields, this can cause problems
Text generation
Although my main inspiration over the years for the generation of text has been the waffle generator, which was surpassed by The Postmodernism generator and the ArtyBollocks generator (used for writing statements about art). I illustrate this point in Getting your SQL Server Development Data in Three Easy Steps, where I provide a way of generating plausible text in PowerShell for use with SQL Data Generator. To do it, I generate banks of data using randomisation. There are commercial tools to do this using Machine learning, such as TensorFlow. OpenAI’s GPT models are becoming increasingly versatile and can be fine-tuned for specific tasks, including generating diverse and contextually relevant text for databases.
My GitHub project ‘LanguageSpoofing‘ shows how to generate any type of text using either SQL or PowerShell. The process is explained in this article on SQLServer Central, Spoofing Data Convincingly: Text Data. This Markov technique can be used to create large amounts of text, maybe whole books, in any particular style- as I demonstrate here in The Parodist: A SQL Server Application. The generation process can be pretty simple where there is little need for complex sentences as I show in The Ultimate Excuse Database written in SQL and the C Sharp version here in ‘The Waffle Generator’
Data can also be generated by using real data as inspiration. The classic way of doing this is by Markov tables as I describe in ‘Spoofing Data Convincingly: Altering Table Data.
String generation
As far as I know, the Redgate SQL Data Generator (SDG) invented the idea of a Regex expression that generates rather than matches text. Obviously, one cannot use a Regex engine to do this; the one in SDG is written from scratch. Once you’ve made the mental leap to understand the principle, it is an easy way of generating short sentences, names or text. Its limitation at the moment is that it isn’t recursive, whereas real human language often is.
Generating numbers and dates
You can make random data conform to the distribution of real data by using a ‘pocket’ technique I describe here in Spoofing Data Convincingly: Masking Continuous Variables. Basically, the technique allows you to map distributions in your generated data that conform to the distribution in a sample. Sometimes, you find that data conforms to one of the common distributions. Numerical data normally isn’t random in its distribution but conforms to some sort of elliptical distribution such as the Gaussian (Normal) distribution or binomial distribution. There are plenty of elliptical distributions such as the Laplace and Weibull distributions and then there is the Pareto distribution which corresponds to the way wealth tends to be distributed. If it is a normal distribution, this is easy to generate (See randomly Getting normally-distributed random numbers in TSQL)
Generating Image data, structured documents and Blobs.
Databases will often have data types that will stop even the keenest data generators dead in their tracks. JSON and XML documents in tables can spell trouble because it can signal the presence of semi-structured data that doesn’t fit easily into the relational model but must somehow be accommodated. To generate this is difficult unless it is accompanied by a schema. Spatial and geographical data will also give problems without a good understanding of the geocoding system, such as coordinates, that is being used. Each of these problems have separate solutions but all these special data types will require the data generation tool to be particularly versatile.
Conclusions
The entire database development and testing cycle requires a series of datasets that can be used to run tests, to engage in test-driven development work and to do Database Development easily and quickly. These datasets must be as realistic as possible. of the more esoteric data types. There are nowadays few alternatives to generating data, particularly before the first release of a database application.
How, then, can you generate data? A tool that generates data is fine to use just as long as you are prepared for a lot of configuration work to provide the full domain language and get around some of the difficulties I’ve listed in this article. I’ve never used an entirely homebrew-solution. I use SQL Data Generator but with the support of plenty of extra data-files, expressions and scripts. These are generated mostly from scripts, but I’m an avid collector of lists of data that could come in useful, if only for the source of markov chains. The tool will always be a blessing because it deals with the foreign key relationships and the actual bulk import, which are both otherwise tedious..
Load comments