Spoofing Data Convincingly: Altering Table Data

When you are developing an existing database, or demonstrating it, you nowadays need pseudonymised data, or even better, completely anonymized data. This data has to look right at first glance, and it needs to have the same distribution as the real data.

Although we are yet to tackle continuous variables with complicated distributions such as sales figures or dates, we can now generate words. This article is about using a finite-state Markov chain with stationary transition possibilities. You can use the technique to generate words, sentences, articles or entire books.

In our first attempt to do a markov-style obfuscation of words, we used an algorithm that was very iterative and was more useful for generating file-based lists of words than for SQL based pseudonymization. File-based lists are excellent for use with data masking, or data-generation, tools. However, if we are using SQL we have the additional resource of the markov table that we can use whenever we want. The advantage of this is that we can change the algorithm to apply to the markov table in order to generate different degrees of data obfuscation.

Before we go on to obfuscate other datatypes than words, we’ll take a bit of time to show how one might use Markov tables in order to pseudonymise tables in-place.

Let’s take, for example, the AdventureWorks person.person table: We want to be able to obfuscate it easily by means of :

(No, I copied the database first!)

Before you get over-excited, it took twenty-eight seconds to do the 20,000 rows of the task. You’d do much faster with prepared tables of obfuscated words, or word collections in files used by a third-party tool, but it is a start.

Here was the same data before we pseudonymized it.

We check the distributions of the data before and after we make the changes, to make sure we haven’t altered it. Let’s first check on the distributions of customers’ lastnames before and after running our process.

And we can do the same with firstnames just to make sure we are matching the existing distributions

We are taking a slow-but-steady approach. We rewrite our code from the previous blog post that assembles the string; it now uses a view to get its random numbers, and we’ll speed it up slightly by putting a bit more intelligence into the markov table. We then put it in a slow User-defined Scalar function. We  want a scalar function that isn’t schema verified and is not considered to be deterministic. The reason for this is that it has to be executed every row despite having the same parameter.

There are many ways to store the information permanently in a Markov table but we’ll be using Table-valued parameters for our function. I’ll show how they are generated from the original information in AdventureWorks, but they could be so easily fetched from a table of markov entries with each markov set identified by a name. This could be delivered to you by the production DBA so that you wouldn’t need any access to the production server.

Now we can try it out, using AdventureWorks.

Although this is fun, we need to move on to do more. You will want to try out what we’ve done on that modification date in AdventureWorks. If you do, you will see that it will produce dates that are legal, but our algorithm won’t produce the correct distribution. Even if we change the dbo.FakedStringFrom function to choose all the characters according to their statistical distribution it will fail to work convincingly with small samples. It needs a rather different technique. We’ll talk about that in the next blog in this series.