2 May 2018
2 May 2018

Pseudonymizing your data with SQL Data Generator

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Imagine that you are the CIO of AdventureWorks. Out of a blue sky comes an order from Taxman that you supply details of all your sales, along with the tax charged to your customers. They want to be able to check that the total tax you pass on to them is correct, and is the same as the tax charged to the customers. They need the details of the purchases, without needing the details of the customers.

There was a time when you might have shrugged and just sent them a copy of the relevant data straight from your database, or even sent them the entire database. Nowadays you need to be more careful. After all, imagine the lurid stories if AdventureWorks became one of those databases that was available ‘everywhere’, after being copied onto the Dark Web? No, before distributing the data, you must obfuscate all the identifiable details of your customers, whilst still giving the required information to people who have a right to inspect your books.

After a difficult conversation with the team leader of your ops team, you decide that because of the complications of identifying and masking out all the sensitive data, it is impractical to deliver a ‘pseudonymized’ copy of the relevant tables from the database. Instead, you want to generate a data-masked version of your standard report, which the Tax Men can pore over.

How? Your developers say they could modify the existing reports slightly but obfuscating various fields, using SQL, would take a long time. Dust out that copy of SQL Data Generator (SDG) and upgrade it, if it tells you there are later versions, because with a little creativity it can do all the hard work for us.

Create the Reporting view

We’ll create a view containing all the data we need for the report to the tax man. Listing 1 just creates the single view we need, but you can create as many of these datasets as you want or need, within a single SDG project.

Listing 1

This view will be our data source, in the SDG project, but as also need to create an empty ‘target’ table, which SDG can use to create the masked report data. Listing 2 will build the empty MaskedReport table.

Listing 2

Creative use of the Existing Data Source in SQL Data Generator

Now we start SDG, create a new project, and specify AdventureWorks as our database.

Figure 1

You’ll notice our empty MaskedReport table is listed. Click on it and you’ll see, in the lower pane, that SDG is straining at the leash, and has figured out strategies for filling in this table with fake data. This might be OK for user training on the AdventureWorks applications, but it will irritate the people in the Tax office more than a little. No, we must first introduce the real data, and then fill the corresponding columns before we then mask the columns that reveal personal data.

Therefore, we need to deselect every table and view (use the button at the bottom), except our masked table.

Figure 2

Instead of generating the data, we want SDG to take the data from our ReportForTaxMan view, so set the table generation settings to use an existing data source.

Figure 3

Hit the Browse button, specify the connection to the same Adventureworks database and select the view.

Figure 4

Hit the Finish button, and SDG will populate the preview pane, for MaskedReport table, with the original data from the view.

Figure 5

Masking the personal or sensitive data

Now we can overwrite the real information with the fake data, where necessary, column by column. I’ll just show how to mask one column, PhoneNumber, to make the point, since the tax people don’t require these customer details.

Click on the column and the Column generation settings pane appears, above the data preview pane. Click on the dropdown list that specifies the current generator (which you’ll see is our view), expand the Personal category of generators and choose the Phone Number generator.

Figure 6

We can use the Phone Number generator without modification, but in my previous article I’ve offered plenty of examples of how to customize SDG’s built-in generators, as required.

Figure 7

We can then go on to do as much data masking as we need. Having obfuscated all personal or sensitive data, save the project and then hit the Generate Data button to run the generator, and populate the MaskedReport table. Any existing data in the table will be cleared out, and replaced with a mixture of masked and real data.

Finally, inspect the contents of the MaskedReport table in SSMS to make sure it all worked correctly.

Figure 8

This dataset is now ready for export, via BCP or by doing a JSON export! If you have several views, each view will insert into separate equivalent tables, of course. I’ve just shown the simplest example of how we can use SDG not only to generate fake data, but to fill tables with a mix of real and generated data.

Build an obfuscated version of the entire database

We can extend this technique to obfuscate the personal or sensitive data in an entire database, although this technique doesn’t scale well to large databases, and you’d be better off with a tool such as Data Masker for SQL Server, which also offers a range of techniques to scramble, shuffle and mask existing data, rather than just replace it with fake data.

However, in SDG, to do this we create a new version of the database, and start up SDG as if we were filling it with faked data. Select a table, and specify that SDG should populate it by mapping to another SQL table or view, as demonstrated in Figure 2 to 4 above.

For each subsequent table, it defaults to that same connection and pre-selects the table with the same name, which is a lot quicker. When you click the Finish button on the wizard, SQL Data Generator matches columns in the two tables based on data type and column name. You can then dal with any columns that need masking.

This works fine for small databases but doesn’t scale well because setting up the project initially requires every table to be specified individually, as using an existing data source, with a default connection. There is no button that allows you to match, at a stroke, all the tables of the empty target database with their counterparts in the source database.

Building just the part of the database that you want

Instead of selecting all the tables, you select just the related tables that provide the data that has been requested. You will need all the tables that are referenced by your FOREIGN KEY constraints: they will need theirs too, and so on. In our case, we can stop with Person.BusinessEntity and Person.Person. This enables us to fill just a small subset of related tables. We then specify that we populate an entire table, or several columns in a table, by mapping to another SQL table or view, and then over-ride individual columns that you wish to mask,

Sometimes it fails because you’ve forgotten a dependent table. If so, concentrate on fixing the first error because these will cascade. If SDG fails with a table it presses on regardless, without the table, and you get errors compounded.

Summary

SQL Data Generator was, as its name suggests, intended to generate the data for databases, as required for testing, development work and training. However, it is perfectly at home with using the original data as the default source of the new data, and then substituting in fake data for any column that you want to mask. This is fine for a wide range of masking purposes, though it is slow and burdensome if you need to obfuscate an entire database. For this you’d need a tool such as Data Masker.

However, many requests for data extracts involve only a handful of tables, such as a customer list, and this is very easy to do in SQL Data Generator.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

You may also like

  • Article

    How to start producing realistic test data with SQL Data Generator

    Many organizations invest heavily in their User Acceptance Testing (UAT) environments, setting up infrastructures that mimic the production environment, often using replicated production data. This is important because it means they can perform acceptance testing as well as security testing, performance testing, error-handling testing, and stress/load testing in production-like conditions, and so be confident they’ll

  • Article

    How to generate realistic text data using SQL Data Generator

    SQL Data Generator (SDG) is very handy for making a database come alive with what looks something like real data, and, once you specify the empty database, it will do its level best to oblige. To get the best results though, you need to provide SDG with some hints on how the data ought to

  • Article

    Realistic, simulated data for testing, development and prototypes

    Generating realistic test data, which reflects accurately the nature and distribution of the data it is emulating, is a challenging task. The task is made more complex if you need to generate that data in different formats, for the different database technologies in use within your organization. This article proposes a scheme for using Redgate

  • Article

    Automatically build-and-fill multiple development databases using PowerShell and SQL Data Generator

    When you are working with a database, you always need data. This is why you need to stock the database with data after you build it. Sometimes, you just want a large number of made-up customer details, sales figures or the like. You also will need columnar data, sometimes known as ‘static data’ or ‘enumerations’,

  • Article

    Automatically filling your SQL Server test databases with data

    This article explains how you can use SQL Data Generator (SDG) to automate data provisioning for test databases, during the database development cycle. Having first created an SDG project file for a database, to define the data generation strategy, we can write a command line batch script, or PowerShell script, which uses a build script

  • Forums

    SQL Data Generator Forum

    Test data generator for SQL Server databases