19 December 2016
19 December 2016

How to generate realistic text data using SQL Data Generator

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

SQL Data Generator (SDG) is very handy for making a database come alive with what looks something like real data, and, once you specify the empty database, it will do its level best to oblige. To get the best results though, you need to provide SDG with some hints on how the data ought to look.

Let’s say you have a column in a table that contains text, and you need to test out your database. For most requirements, you can get away with any text, but sometimes you may need to do something smarter to generate text that looks like genuine text your customers or end users would enter.

In this article, we’ll look at some of the easier ways of filling text, using SDG’s built-in data generators. We will use as an example the Customers database, essentially a Name and Address (NAD) database which has a table that holds notes that apply to a person:

SQL Data Generator 1

Figure 1

The SQL String (ASCII) generator

When you know nothing about the type of data that is to go in a string, you can always use the string (ASCII) generator, as shown in Figure 2:

SQL Data Generator 2

Figure 2

This at least will fill the table, but really one is left with the feeling that it is possible to achieve something far more like live data. In our case, we want words, and we want to check our search function, so the words will need to be meaningful.

‘Id eudis quo’ generator

One’s next reaction is to use Lorem Ipsum (sections 1.10.33 of Cicero’s ‘De finibus bonorum et malorum‘). After all, that is what it is for, isn’t it? SQL Data Generator obliges. In the ‘Shopping’ category, you’ll find the Description – Id eudis quo built-in generator.

In fact, this is the generator SDG chooses by default for this Note column. It takes a random chunk of text from whatever you paste in. It has Latin text built-in, ‘Id eudis quo linguens imaginator pars fecit. Et quad estis …‘ and so on, but I’ve pasted Cicero’s original text in over the top:

SQL Data Generator 3

Figure 3

We have achieved something that looks like words, but this actually doesn’t look much like the sort of contents of a note about a customer, does it? I know from experience this would distract people doing user-acceptance testing (UAT). You can, of course, use English text but it is not a lot better. Try it, by pasting samples of text into the text window to convince yourself.

The Text Shuffler

We have other alternatives, such as the Generic Text Shuffler generator. Paste in a block of text, and it just picks words out of it at random. This is surprisingly effective. Here, we take the first half of the King James Bible’s translation of the book of Jeremiah. The result is quite a good facsimile of the text:

SQL Data Generator 4

Figure 4

Naturally, we are not suggesting you use this text, but maybe a large pool of text from your production data, only using the common words. Again, it is useful to try out samples of text to see what happens.

The File List generator

We can collect a whole lot of samples into a text file and use that via the File List generator. To try this out, I took the text of ‘Frankenstein’, used a couple of nifty regex strings to put a sentence on each line and remove any line less than forty characters long, and then saved it in a text file called FrankensteinSentences.txt.

SQL Data Generator 5

Figure 5

If you’d like to try it for yourself, download the file, and place it in C:\Program Files (x86)\Red Gate\SQL Data Generator 3\Config, along with all the built-in files. Then simply select the File List generator and specify this file:

SQL Data Generator 6

Figure 6

Ah. This is looking a lot better. It has pulled in the file, and for each row of the column has inserted a sentence from Frankenstein. Naturally, we wouldn’t insist on Frankenstein, but sample anonymized data from your application would do fine.

The CSV generator

Of course, rather than using Frankenstein text, you’re more likely to achieve verisimilitudinous text by reusing some customer comments, which you’ve likely got stored in an Excel spreadsheet somewhere (relax, these ones aren’t real):

SQL Data Generator 7

Figure 7

So just save these in the same style as the attached CSV file and, in SDG, select the CSV generator and choose the file. The CSV File Data Import Settings form will then pop up:

SQL Data Generator 8

Figure 8

Make sure Skip first row is checked, so we don’t import the column headings, then configure as appropriate for the format of the data you want to import, and click Next. We only want to import the Comment column (Column4) so highlight it and click Finish:

SQL Data Generator 9

Figure 9

So now you will see that you are home and dry. All you need to do is to decide whether you want NULLs and whether you want to repeat the collection of testimonials to fill all the data, shuffling them as you do so:

SQL Data Generator 10

Figure 10

Summary

We have achieved text data that will even be good enough for user-acceptance testing. It isn’t perfect, of course, because in a large database, these phrases are repeated and we don’t necessarily want that.

It is also likely to produce data that is skewed from your production data. You don’t want that if you are doing performance testing because SQL Server is likely to produce a different execution plan.

Is it possible to do better? Oh yes. We can produce a Python expression, a SQL function or even a custom generator to do something much better. However, that will have to wait for another article.

To find out how you can generate realistic test data fast, download a free, fully-functional 14-day trial of SQL Generator.

Tools in this post

SQL Data Generator

Create large volumes of realistic test data with a couple of clicks in SQL Server Management Studio.

Find out more

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Related posts

Also in Hub

Documenting your Database with SQL Change Automation

It would be wrong to portray SQL Change Automation (SCA) as being suitable only for epic project deployments, of the sort described in my previous article. It can do smaller tasks as well. To demonstr...

Also in SQL Data Generator

Pseudonymizing your data with SQL Data Generator

Imagine that you are the CIO of AdventureWorks. Out of a blue sky comes an order from Taxman that you supply details of all your sales, along with the tax charged to your customers. They want to be ab...

Also in Product learning

SQL Prompt Code Analysis: INSERT INTO a permanent table with ORDER BY (PE020)

The SQL query that is used to produce the result that is inserted into the permanent table has its order specified by an ORDER BY statement. Relational tables are not ordered, so the ORDER BY is meani...

Also about SQL Data Generator

Protecting production data in non-production environments

A traditional IT problem in many organizations is that the development and operations teams each work in their own silo, and each tends to regard the other as having different, and often conflicting, ...