How to generate realistic text data using SQL Data Generator
To get the best results, you need to provide SQL Data Generator with some hints on how the data ought to look. Phil Factor nudges it towards realistic text data.
SQL Data Generator (SDG) is very handy for making a database come alive with what looks something like real data, and, once you specify the empty database, it will do its level best to oblige. To get the best results though, you need to provide SDG with some hints on how the data ought to look.
Let’s say you have a column in a table that contains text, and you need to test out your database. For most requirements, you can get away with any text, but sometimes you may need to do something smarter to generate text that looks like genuine text your customers or end users would enter.
In this article, we’ll look at some of the easier ways of filling text, using SDG’s built-in data generators. We will use as an example the Customers
database, essentially a Name and Address (NAD) database which has a table that holds notes that apply to a person:
The SQL String (ASCII) generator
When you know nothing about the type of data that is to go in a string, you can always use the string (ASCII) generator, as shown in Figure 2:
This at least will fill the table, but really one is left with the feeling that it is possible to achieve something far more like live data. In our case, we want words, and we want to check our search function, so the words will need to be meaningful.
‘Id eudis quo’ generator
One’s next reaction is to use Lorem Ipsum (sections 1.10.33 of Cicero’s ‘De finibus bonorum et malorum‘). After all, that is what it is for, isn’t it? SQL Data Generator obliges. In the ‘Shopping’ category, you’ll find the Description – Id eudis quo built-in generator.
In fact, this is the generator SDG chooses by default for this Note column. It takes a random chunk of text from whatever you paste in. It has Latin text built-in, ‘Id eudis quo linguens imaginator pars fecit. Et quad estis …‘ and so on, but I’ve pasted Cicero’s original text in over the top:
We have achieved something that looks like words, but this actually doesn’t look much like the sort of contents of a note about a customer, does it? I know from experience this would distract people doing user-acceptance testing (UAT). You can, of course, use English text but it is not a lot better. Try it, by pasting samples of text into the text window to convince yourself.
The Text Shuffler
We have other alternatives, such as the Generic Text Shuffler generator. Paste in a block of text, and it just picks words out of it at random. This is surprisingly effective. Here, we take the first half of the King James Bible’s translation of the book of Jeremiah. The result is quite a good facsimile of the text:
Naturally, we are not suggesting you use this text, but maybe a large pool of text from your production data, only using the common words. Again, it is useful to try out samples of text to see what happens.
The File List generator
We can collect a whole lot of samples into a text file and use that via the File List generator. To try this out, I took the text of ‘Frankenstein’, used a couple of nifty regex strings to put a sentence on each line and remove any line less than forty characters long, and then saved it in a text file called FrankensteinSentences.txt.
If you’d like to try it for yourself, download the file, and place it in C:\Program Files (x86)\Red Gate\SQL Data Generator 3\Config
, along with all the built-in files. Then simply select the File List generator and specify this file:
Ah. This is looking a lot better. It has pulled in the file, and for each row of the column has inserted a sentence from Frankenstein. Naturally, we wouldn’t insist on Frankenstein, but sample anonymized data from your application would do fine.
The CSV generator
Of course, rather than using Frankenstein text, you’re more likely to achieve verisimilitudinous text by reusing some customer comments, which you’ve likely got stored in an Excel spreadsheet somewhere (relax, these ones aren’t real):
So just save these in the same style as the attached CSV file and, in SDG, select the CSV generator and choose the file. The CSV File Data Import Settings form will then pop up:
Make sure Skip first row is checked, so we don’t import the column headings, then configure as appropriate for the format of the data you want to import, and click Next. We only want to import the Comment column (Column4) so highlight it and click Finish:
So now you will see that you are home and dry. All you need to do is to decide whether you want NULLs and whether you want to repeat the collection of testimonials to fill all the data, shuffling them as you do so:
Summary
We have achieved text data that will even be good enough for user-acceptance testing. It isn’t perfect, of course, because in a large database, these phrases are repeated and we don’t necessarily want that.
It is also likely to produce data that is skewed from your production data. You don’t want that if you are doing performance testing because SQL Server is likely to produce a different execution plan.
Is it possible to do better? Oh yes. We can produce a Python expression, a SQL function or even a custom generator to do something much better. However, that will have to wait for another article.
To find out how you can generate realistic test data fast, download a free, fully-functional 14-day trial of SQL Generator.