Product articles SQL Data Generator Generating Realistic Test Data
How to Complain Bitterly using SQL Data…

13 December 2019

1 Comment

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

13 December 2019

1 Comment

How to Complain Bitterly using SQL Data Generator

How to use reverse regex expressions in SQL Data Generator to fill a text column in SQL Server with fake, but very realistic complaints.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

SQL Data Generator gets a lot of use in shops that need a lot of data but can’t, for legal reasons, use real customer data in development work. Until you’ve used the tool, it’s easy to dismiss it as only being useful for generating random dates and numbers, but with a bit of ingenuity, it can be used to do quite complicated tasks with text too.

My article, How to generate realistic text data using SQL Data Generator, shows various ways to produce text for a Note field in my Customers database, ranging from a basic Lorem Ipsum generator, to use of more versatile file list and CSV generators. They show the general principles, but what if you need something more realistic? Something, maybe that you could use for user training or for Acceptance Testing.

Here, I’m trying a different approach, generating the text using SQL Data Generator’s Regex Generator, with some reverse regex expressions. The result if often very useful for testing out index strategies for doing fast searches. I’ve previously explained how these expressions work, and how to use them to generate realistic addresses. Here I’m going to use them to emulate the typical, often quite bitter, complaints one reads on the various vendor websites, along with some suitably soothing responses.

You can start off by using literal strings in the reverse regex expressions, but this approach gets awkwardly long after a while, so I’ll show how to use reverse regex expressions that import lines of text from files.

If you want to play along, you can download the latest incarnation of my Customers build script at the bottom of the article, along with some example text files, from which we can generate some quite elaborate and varied complaints and responses.

Using literals with the Regex Generator

If the text doesn’t need to be entirely authentic, but good enough for user-acceptance tests and demos, then you can generate text by combining literal strings in imaginative ways, employing various elements of the reverse regex syntax.

Or (|)

You can make an expression that constructs simple sentences from one of several possible dependant clauses, just by putting them between brackets and putting the ‘or’ symbol (‘|’) between each one. For example, if you want to generate fake delivery promises to a customer, you’d start out by using:

I can assure you that it will be dispatched on (monday|tuesday|wednesday|thursday|friday)

To see the results, just choose the Regex Generator for the Customer.Note column, and pop this into the Regular Expression box:

It looks a bit mechanical, so we need to add a few flourishes. You can nest these expressions as well, like this:

I (can (assure|promise|guarantee to|warrant to) you that it will be| have (already|) (requested|mandated|arranged|asked) for it to be|have( already|) ensured that it is) dispatched on (monday|tuesday|wednesday|thursday|friday)

This results in the random choice of one of three basic sentence constructions (“I can xxxx you that it will be dispatched…” or “I have (already) yyyy for it to be dispatched…” or “I have (already) ensured that it is dispatched…

You can repeat these as many times as you like.

Curly braces

The {} syntax is more versatile. I can show this with a modification to the ‘nesting’ example:

I (can (assure|promise|guarantee to|warrant to) you( sincerely){0,1} that it will(, (unless a disaster happens|without fail|God willing),){0,1} be|have( already){0,1} (requested|mandated|arranged|asked) for it to be|have( already){0,1} ensured that it is) dispatched on (monday|tuesday|wednesday|thursday|friday)

The (sincerely){0,1} means ‘do the preceding bracketed expression between 0 and 1 times’, meaning in this case that ‘sincerely’ will appear half the time, roughly. There is a shorthand version of this using the question mark, (sincerely)?.

If you leave out the maximum value, for example (already){3,}, it will produce the bracketed expression at least three times, and the maximum will be the maximum width of the column. I haven’t dared to find out what happens if the column you are filling is a varchar(max), so be warned!

Asterisk (*)

There are just a few other bits of Regex syntax you might need. The asterisk means “zero or more”, up to what will fit in the datatype. For example:

Never will I do that! (No!)*

Plus (+)

We can use the plus symbol (+) to indicate one or more repetitions of the preceding expression, again up to the maximum width of the datatype. For example:

I like it: (Very, )+ much

Question Mark (?)

As mentioned earlier, the ‘?’ simply means “include or not”.

I will (definitely )?(not )?Vote for your (wonderful )?candidate

(regexp) grouping with | alternatives

For example:

(He is|She is|it is|they are|You are|We are) (despotic|dictatorial|autocratic|tyrannical|absolute|imperious|arbitrary|domineering|overbearing)

Bitter complaints, soothing replies

This reverse-regex syntax is very versatile once you get the hang of it, so let’s then make a stab at creating some angry customer complaints and their soothing replies.

We can do it just using string literals, as we have up to now, but the regex expression soon becomes rather bulky, and difficult to maintain, as you can see:

The expression is a bit intimidating isn’t it? At a certain point, we need to use another tactic.

Building Reverse regex expressions with text files

As an alternative, the reverse-regex expression can simply extract one random line from a list of lines saved in a file, in the order the files are listed. With the built-in files, you could try this:

($"UKPostTown.txt", $"UKCity.txt", $"UKCounty.txt")

This fills in an address field for a UK address, using standard text files shipped with SQL Data Generator. However, we can and should add our own custom files to this directory. There is a slight problem in that we lose some of the variability because we can’t embed regex expressions in these file-based lines of text. Although the number of alternative strings is enormously increased, you cannot put reverse regex expressions inside the lines in the file. The directory is in a Config subdirectory of the installation directory for SQL Data Generator. To determine where SQL Data Generator program is installed, the easiest thing to do is to select a File List generator for a column, then hit Browse to establish where it’s stored.

In this case, we will provide text files with lists of sentences taken from complaints. We have a lot of source material, more than enough for even the most exacting requirements, thanks to the sites that publish their customer comments.

Curiously, customer complaints are usually structured in three phases. They start out venting their anger. Then they explain their problem. Finally, they huff and puff. This means we only need three files from which we pick a line.

The same, interestingly, is true of the responses to a complaint. They start with a gush of sympathy. Then they respond to the complaint as well as they can, and finally pirouette off with an apology or well-chosen phrase implying that the customer should stick with the company. Again, three files. I include my example data but with a cautionary word that the more you collect of this stuff, the better.

Here is a relatively simple example:

From the customer: 
($"firstSentence.txt". ){1,2} $"MiddleSentence.txt".  $"FinalSentence.txt"{1,2}

$"NamesFirst.txt" [A-Z]? replied:
$"FirstSentenceReply.txt" $"SecondSentenceReply.txt" $"FinalSentenceReply.txt"

This will produce something like this:

You can adjust the complaint and response lines in the source text files to suite what you need. When you’re satisfied, hit Ctrl-G to generate the data, then in SSMS, we can extract a few sample complaints:

From the customer: 
Absolutely ridiculous service. Whats the point of picking a day for the item to be delivered if its going to be delivered on a different day?!!!.  Never again.  Disgusting service.

Jean replied:
Thanks for reaching out. I'm so sorry, I definitely understand your concern and I'm escalating your issue so that someone can take a closer look at what's going on right away. We're proud that you've chosen to buy from us in the past, and we would like the opportunity to keep you as a satisfied customer. Thank you for your email, although I was most concerned to read of the problems you've been having and offer my sincere apologies.

From the customer: 
Really I'm surprised that you risk your reputation like this Disgusting service. This has been a great disappointment.  I ordered my friend a very expensive gift three weeks ago, meant to be next day delivery. A week ago still hadn't arrived so I phoned your customer services who said it was probably LOST IN TRANSIT so ordered me another one. I waited and waited and it didn't arrive so I phoned and asked for a refund and to cancel the order which you said You would do. I've heard nothing since.  Definitely will NOT be coming back. Useless. Wish I went elsewhere.

Dan F replied:
Thank you for your email, although I was most concerned to read of the problems you've been having and offer my sincere apologies. We take our customers' problems seriously and are glad to hear from you. I'm so sorry for the confusion and frustration this has caused. As much as we rely on technology and training to provide customers with consistent, up-to-date information, we deeply regret when breakdowns like this occur.

From the customer: 
I really feel let down. This is the 4th time I've had to complain.  Order 3 items on 3 for 2 to be delivered to my local store. Received an email to say only 2 of my items were delivered and there was a delay on my item but no reason why. When I checked my account online, it said my item was dispatched to store.  Terrible customer service, assurances not fulfilled. I will never purchase from you again. In my opinion you are Lying , cheating , thieves , who don't care about the customer and break all data protection rules . NEVER EVER AGAIN, none off your staff can tell the truth, they all mislead you.

Bob HR replied:
Thank you for your email, although I was most concerned to read of the problems you've been having and offer my sincere apologies. We will immediately trace the shipment to see exactly what went wrong. Thanks for reaching out. I'm so sorry, I definitely understand your concern and I'm escalating your issue so that someone can take a closer look at what's going on right away.

You can, of course elaborate the generator, especially if you have more phrase-banks with which to work. You can have nested expressions containing references to files, so you can make things much more complicated.

You can leave a macro in there, for example, maybe something distinct that wouldn’t occur in normal text, such as |name|, which can then be replaced once the text is loaded into the database with SQL, by using the REPLACE function within an UPDATE statement. You can reference a value in another column. However, the regex technology itself is simple, with a bit of practice, and the generation process is very quick.

Conclusions

As the technology for generating fake data, or obfuscating real data, becomes smarter and smarter, so the requirement to work with real personal or sensitive data gets less and less justifiable. Sure, there will always be a need for checks that can be done under secure conditions in staging, but not for general use. Besides that, it is quite a fascinating task that can extend the skills of a developer and provide harmless amusement on a dull afternoon.

Downloads

  • Fred Bullmer

    Now I know how phishing e-mails get generated.

You may also like