Product articles SQL Data Generator Generating Realistic Test Data
Generating test data with localized…

Generating test data with localized addresses using SQL Data Generator

With basic grasp of regex expressions, a tool like SQL Data Generator can generate test data with addresses that are realistic for your part of the world.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

You want to run some database tests but you can’t use real production data because it contains personal data, including customer names, addresses, and so on. However, you do need to develop and test the database with data that reflects reasonably accurately the nature of your customer data. How do you generate data that gives you realistic-looking addresses for your part of the world?

With a little tweaking, and a basic grasp of ‘reverse regex’ expressions, a tool like SQL Data Generator (SDG) can do the job very well. As an example of what’s possible, in this article I’ll show how to generate realistic UK addresses.

SQL Data Generator out-of-the-box

For demonstrating the features of SDG in this article, all we need is a customer address table, which you can recreate using the following code:

Listing 1

We load SDG and see what it can attempt on its own. It has done quite well. It has seen the names and data types of each of the columns, and selected a built-in generator for each one:

Generating test data is a task that can be achieved in seconds

Figure 1

It hasn’t realized that you shouldn’t mix address types from different geographical areas; it’s mixed US-style address lines and cities, with UK counties and post codes. Some of this is easy to fix; just change the generator for Addressline2 to the built-in generator, UK Towns, which imports a list of UK towns, and likewise change the City generator from US City to UK City. If you’re somewhere else, it’s okay because you can load the addresses from a file. I’ll show you how in a later article.

Generating test data is made easier with customizable options

Figure 2

However, AddressLine1 needs a bit more work, so let’s see what we can do to improve it.

Introducing ‘reverse regex’ expressions

A regular expression is “a sequence of symbols and characters expressing a string or pattern” which we can use to search, or make replacements, within text strings. A reverse regex, therefore, is simply a string or pattern with which we can generate text. One of the main objectives of this article is to get you started with using ‘reverse regex’ expressions confidently, using the “generate UK addresses” use case as an example.

To see the ‘reverse regex’ that created the current column values, simply click on the AddressesLine1 column heading to highlight the column in light yellow:

Listing 2

Let’s tackle this reverse regex piece-by-piece, starting with the first expression, which generates the house number:

([1-9][0-9][0-9]?)

This means, phrase by phrase:

  • [1-9] means use a number between 1 and 9. Those square brackets are to denote a collection of characters and the ‘-‘ sign means ‘and all the ASCII characters between the two in the ASCII sequence’. This is similar to the LIKE expression in SQL Server.
  • [0-9] means use a number between 0 and 9
  • [0-9]? means use a number between 0 and 9 or nothing. The ? means ‘include or not’

Those brackets delimit the expression. You must do this for any regex expression. For example, (Yes|No) means “use either Yes or No”.

So the expressions as a whole, ([1-9][0-9][0-9]?), will generate a minimum or 2 digits, a maximum of 3, the first between 1 and 9, the second being between 0 and 9, and maybe a third digit between 0 and 9. If you scan the AddressLine1 column in Figure 2, example, you’ll see that is what we’ve got.

The next expression is:

(North |East |West |South |||||||||||)

This means “return the word North or East or West or South or nothing or nothing or … etc”. That | is an ‘or’ keyword. Why all the twelve ‘nothings’ (empty strings)? This is to skew the distribution so that there is only a 25 percent chance of having a word returned at all, and this has an equal chance of being North, East, West or South.

Following this expression is one that is very similar:

(Green |White |Rocky ||||)

Four chances in seven, you get nothing, otherwise it is ‘Green’, ‘White’ or ‘Rocky’.

The last two expressions are so similar that they can be described together. They are:

  • (Nobel|Fabien|Hague|Oak|Second|First|Cowley|Clarendon|New|Old|Milton)
  • (Way|Street|St.|Avenue|Road|Parkway|Freeway|Drive|Boulevard|Blvd.)

In each case, you get an equal chance of the listed alternatives. So that was what SQL Data Generator attempted. We can see if we can do better!

Customizing the first-line address using ‘reverse regex’ expressions

We can get the house number like this:

[1-9]{0,3}

This means “give me a numeric digit between 1 and 9, between none and three times”. This produces, on an example run:

634,,27,,473,5,754,132,282,,6,5,75,198,,56,35,537,75,49,22,141,693, 18, (etc.)

You can spot the problem. There is no zero, but if we put one in, there is a chance of a leading zero and this is, you’ll remember, a string.

If we put in a first character between 1 and 9, and maybe a second one and even maybe a third, between 0 and 9 that would give a better range:

[1-9][0-9]{0,2}

However, this means that every address would have a house number, whereas some houses have only names. We’ll fix that in a moment, but first, we ought perhaps to put in something extra for when a house is divided up into flats:

[1-9][0-9]{0,2}(a|b)?

This will append ‘a’ or ‘b’ to the end of the number 50% of the time, and nothing the other 50%.

These curly brackets can be very useful. For example:

  • [0-9]{5} means produce five digits between 0 and 9.
  • [0-9]{5,} means produce at least five digits, the upper limit being the width of the column
  • [0-9]{,5} means produce at the most five digits.
  • [0-9]{2,5} means produce at least two digits and at the most five.

Before we move on to tackle street names, we need to account for cases where we have only a house name, and also cases where we have named houses split into flats. The following expression will either produce a number between 1 and 9, possibly followed by ‘a’ or ‘b’, or nothing at all.

(([1-9](a|b)?) )?

Then we simply append some common house names. Easy!

((([1-9](a|b)?) )?(Orchard |Meadow |Rose |Holly |Oak |Willow |Sunnyside |Springfield |Highfield |Old School |Primrose |Mill |Yew Tree |Oaklands | Lilac |Honeysuckle |Hillside |Treetops |Woodside |Ivy |Woodlands |Red |White |Wayside |Granary |Lakeside |Stable | Haven |Fairview |Laurel |Thornfield |Hillcrest )(House |Cottage |Grange |Meadow |Orchard ||Barn |Mews )|([1-9][0-9]{0,2}(a|b)? ))

Now it’s time to add the street:

(North |East |West |South |Lower ||||)(Manor|Sebastobol|Wellington|High|Vicarage|Rectory|Cowley|Clarendon|New|Old|Milton) (Way|Street|St.|Avenue|Road|Parkway|Close|Drive|Lane|Park|Mews)

Here, we have the occasional qualifier of North, East, Lower, and so on. The |||| expressions are added to return empty strings of course, because we don’t always want these and the ? qualifier only reduces the chance to 50%.

Listing 3 shows the complete reverse regex expression that gives a reasonable attempt at a UK first-line address:

Listing 3

In SQL Data Generator, simply replace the existing expression for AddressLine1 with that in Listing 3. Figure 3 shows the improved data to be generated for the table:

sql-data-generator-3

Figure 3

Once you’re happy, don’t forget to save the SDG project, so you can reuse it later.

The reverse regex expression is long, but now we’ve got the hang of it, we can go on to improve it to read in data from a file instead of listing the data between ‘or’ symbols (|). This would enable us to choose from a wider range of values, or use some of the files that SQL Data generator provides.

The reverse regex syntax

So to sum up, here are the main features of reverse regex. There aren’t many, but what you get can be surprisingly versatile.

Syntax Example Generates
ordinary chars Tony Tony
[chars]character set [A-Z0-9] eg, 5 or G
individual chars [FM] F or M
initial ] in char set []] ]
[x-y] range [0-9] eg, 3 or 9
complement [^abc] eg, d or #
* zero or more abc* eg, abcccccc or ab
+ 1 or more abc+ eg, abcccc or abc
? Include or not abc? ab or abc
(regexp) grouping (abc)*d eg, abcabcd or d
{num} repeat a{4} aaaa
{min,max} repeat a{2,3} aa or aaa
{min,} at least min repeats a{3,} eg, aaa or aaaaaaaaaa
() empty string ()
| alternatives Yes|No Yes or No
Empty Alternative (|some-|often-)time eg, some-time or time
Escaped Character \\ \
Unicode character \u0021 !
Word Character \w Any character found in a word including numbers (equivalent to [A-Za-z_0-9])
Pick a line from a file $”NamesFirstFemale.txt” Daphne (imports the text file containing female first names.)
Pull a value at random from another column of the same table $[DateValidFrom] 12/3/2016

As you can see, SQL Data Generator is a fast, simple tool for generating test data. If you’ve not tried it before, take advantage of the 14-day, fully functional free trial.

Tools in this post

SQL Data Generator

Generate realistic test data for SQL Server databases

Find out more

SQL Toolbelt

The industry-standard tools for SQL Server development and deployment.

Find out more