31 August 2018
31 August 2018

Pseudonymizing a Database with Realistic Data for Development Work

How to use SQL Data Generator, and PowerShell to obfuscate personal data (names), while retaining the same distribution of data, so that the test database behaves like the original.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

For this demonstration, we will take AdventureWorks and produce a pseudonymized copy for development work. The aim is to surgically alter just the data that can identify individuals but leave everything else intact. I’ve shown how to do something similar before, for a subset of the data, in Pseudonymizing your data with SQL Data Generator. However, this time we want to work with the whole database, and we also need to ensure that the new data has the same overall data distribution as the old data. We want the data to look uncannily like the real thing.

For the demonstration, we will restrict ourselves to removing just the names in a single table, and replacing them with randomised names, generated only from the characteristics of the real names, to get the same distribution as the original. Hopefully, the result is suitable for development work as well as testing.

Although we will use SQL Data Generator to do this, the principle also works with Data Masker for SQL Server, and you will want to use that tool if dealing with any relationships. The spoof data is generated according to a Markov distribution, in order to give it extra verisimilitude, and my Spoofing Data series on Simple-Talk provides further details on using Markov chains in T-SQL, so I won’t repeat much of that detail here.

The Process

This process, once set up, is infinitely repeatable via PowerShell automation so that you can easily make it part of your daily build. It is designed to get over the problem that entirely generated data doesn’t ever conform to the same distribution of data as the original, or even to look much like the original data. Remember that were this real data, you would still need to keep the resulting data under the same security regime as the original, because it is pseudonymized rather than completely anonymized. Most of the data is left intact, so a determined hacker could identify a few individuals in pseudonymized data

Preparing a copy of the database with its data

In this demonstration, I’m using a database called Phasael, which is just copy of AdventureWorks2016, including the data. You can restore from a backup, attach a copy of the mdf file, or use the technique I show here, Deploying Multiple Databases from Source Control using SQL Change Automation, using SCA and BCP.

For the sake of this demonstration, we are only going to alter just one table, Person.Person. In reality, we’d need to find, and replace, all the personally identifiable information, such as the address information, and so on. We aren’t going to mess with any relationships, as defined by the FOREIGN KEY constraints. For that, you’d need Data Masker for SQL Server.

In SSMS, run Listing 1 on Phasael, to disable constraints. Don’t worry, this is a temporary thing you do when changing data and it is reversible, as I’ll eventually show.

Listing 1

Preparing to make lists of data

Now we create the new data in SQL, using the real data as inspiration. This is done only once when starting this process, and thereafter only occasionally, if the characteristics of the original data change significantly.

We create a new utility database or use an existing one on the same server as the database we are obfuscating. To start, we create some utility procedures that we’ll use to generate lists of column values that contain spoof values that conform as closely as possible to real values.

Surnames, for example, might look like this: Serez, Wara, Garson, Subran, Ree, Perris, Gomirez, Zhans, Lopez, Riversen, Murkera, Herry, Munoz, Natterrez, Shan, Raje, Zhe, Kumarro, Simmon, Johns, Lopez, Liu, Kin or Gomez.

A Markov table merely records the frequency of every set of adjacent three-letter combinations (trigraphs). Factor, for example will be composed of ‘Fac’ (position 1), ‘act’ (position 2), ‘cto’ (position 3), and ‘tor’ (position 4). Add a few thousand last names into the mix, and there will be several trigraphs that can follow ‘Fac’. If we then select ‘Fac’ by random numbers, according to its frequency in the distribution we would choose one of them to get the next letter. Obviously, they would all start with ‘ac’. You can do it randomly, or by frequency in the population. We would choose the latter, though the former provides funnier results, and more akin to ‘smudging’. You then just continue finding each trigraph, each of which contributes a letter to the final string. When you reach the point where a trigraph is a string-terminator, then you’ve got your value.

Listing 2

Making the lists of data as tables

Now we can generate lists of the randomised spoof data, from markov chains. There will be a table for each list, just to keep things simple. We will generate lists for the FirstName, MiddleName, LastName, Title and Suffix columns.

This will take some time but remember that you only need to do it once, until the nature of the original data changes significantly.

I’ve generated this script by cut ‘n paste from a single batch that did just one column, so this is less arduous than it looks. You can, of course, do this with a stored procedure, but even that isn’t entirely as elegant as you’d imagine.

Listing 3

Let’s just test out what we’ve done. Whereas the original had 56 people whose first name started with ‘Aar‘, all of which were Aaron, the Markov chains have introduced a rich variety of 49 names beginning with ‘Aar‘, because the string ar_ (where ‘_‘ means any character!) occurs with many different third characters in the data at the second position.

Listing 4

It yields the results in Figure 1.

Figure 1

We can see this variety by using this query:

Listing 5

Which gives:

Figure 2

Note that some data can’t be generated via Markov chains. Credit cards, for example, require a different technique, see Spoofing Data Convincingly: Credit Cards. Also, I show how to do dates in Spoofing Data Convincingly: Masking Continuous Variables.

The whole point is to make the spoofed data convincing enough to fool the casual glance, and pass obvious CHECK constraints, whilst keeping the distribution of the original.

Copying the tables to files

Now we have these lists as tables, it is the work of a moment to turn them into ASCII lists in your filesystem. We’ll use PowerShell for this. This script will do any lists of a column in a table in a database, as long as it is on the source server, defined by $SourceServerName.

Listing 6

If you run this, then you will find your lists written out as text files in your file system.

Figure 3

If you do a lot of database testing, you are likely to end up with many of these lists, and they are very handy for the purpose. It is possible to stitch these into tables using a technique I demonstrate in the blog ‘Spoofing Data Convincingly: Doing the Shuffle and Zip’.

Creating the SQL Data Generator project file

Now we are ready to create the SQL Data Generator project file (.sqlgen)that we can then use to automate this process.

Open SQL Data Generator (SDG), start a new project, and set it to the database we need to pseudonymize, Phasael. Then, exclude all the tables, as in the illustration, by clicking on Actions | Exclude All.

Figure 4

SDG now has nothing to do, so our next job is to select the Person.Person table:

Figure 5

You will see a warning to the effect the Person.Person is referenced by a FOREIGN KEY constraint, so SDG can’t delete the data in the table before filling it. However, it doesn’t know that we already disabled the CHECK constraints, so we can ignore these warnings.

Now we specify a generator for the entire table, which will be used for all the columns as appropriate. At the moment, SDG has only one table-level generator, but it is extremely powerful, because individual columns can be subsequently over-ridden.

Figure 6

Hit Browse… to select an existing data source and specify the server and database to use as the source of the table. By default, SDG will look for the source table in the Phasael database, but of course the source mustn’t be the same database and table as the one we are stocking with data, so specify the original copy of the database, AdventureWorks2016, instead.

Select the Person.Person table and hit Finish. We have now populated our Person.Person table in Phasael with the original data. SQL Data Generator is nervous at this point, because it hasn’t detected that we’ve disabled constraints. Fortunately, SQL Data Generator will always attempt an operation, despite its warnings and misgivings.

Figure 7

If we select any column in the preview pane, above, we can over-ride its generator. We want to therefore over-ride the generator for our individual name-related columns, while keeping the rest of the data intact.

Figure 8

We want the File List generator, so we click it and SDG will choose one of the pre-supplied file lists from its Config folder. Hit Browse, and point it instead to our FirstName.txt file, and lo…the spoof data from our SQL and PowerShell machinations is now in the preview pane.

Figure 9

Now we just repeat this process for the four other lists, and we get this:

Figure 10

So, now, the names bear no resemblance to the originals. The final step is to see if our generator will work despite SQL Data Generator’s nervous warnings, so hit Actions | Generate Data and SDG will present you with this action plan:

The data generation actions are listed in execution order.
Delete

And when you hit the Generate Data button, it does it. Once the data generation process completes, you now have the Phasael database, with obfuscated name data.

Make sure that you save the .sqlgen project file, which is an XML file with all your settings. With this, you can repeat the process as many times you need. If you plan to run this pseudonymization process often, it is worth nailing down all your settings, such as the location and name of lists, and the servers you use for the process. These SQLGEN projects are easily modified, but they are laborious.

The last thing you need to do is pop back over to SSMS and re-enable constraints on the our partially-obfuscated Phasael database.

Listing 7

Tidying up

This is, of course, only a demonstration, but if you are doing this for real, you need to be careful not to leave artefacts. We deleted the old rows from the database, but you have to be sure that you haven’t left the old data in the transaction log, or the data pages, when doing this for real with sensitive data. The best way is to BCP all the data across, from the database to a freshly-built copy, but this is only practical with small databases.

Automating the process

I show how to automate a SQL Data generation in my article, Automatically filling your SQL Server test databases with data. To see how to do this for a whole range of development machines, see Automatically build-and-fill multiple development databases using PowerShell and SQL Data Generator.

Conclusions

The worst part of preparing development databases with data that has the characteristics of the real data, but without any real personal data, is the time it takes to do it. Plenty of solutions work with small amounts of data. With this solution, once the SQLGEN file is prepared, the only slow parts are the preparation of the spoof data and the actual deletion of the existing data.

The lists of spoofed data don’t have to be of the entire set of data, since SQL Data Generator will cheerfully repeat its contents randomly for as long as you wish, and you only create the lists once. The deletion process won’t worry you much if it is automated and can run overnight.

I’d like to emphasize that this process can be done without SQL Data Generator, but only with a lot of hard work, testing and maintenance. There is no excuse nowadays for developing databases with production data that contains sensitive or personal data.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

You may also like