Spoofing Data Convincingly for Database Development. An Introduction

Many times I’ve been told, by developers who are using live data to develop a database, that it is impossible to anonymise or pseudonymize their data to comply with privacy legislation. One recurring explanation is that one can’t duplicate the distribution of data by faking it. In a sense, this is true, because if you change data you always upset some part of it, such as the number of people called ‘Nigel’, living in Montana. However, this is likely to be a minor inconvenience compared with the risks of exposing personal, financial or business data. Other than this, it is relatively easy to have as much data as you want that looks convincing and conforms to the distribution of the real data. Not only that, but to do it, you need never have access to the real data. All you need to know is the distribution of data.

This article sets the scene for a whole ‘spoofing data…‘ series:

Generating fake country names

Let’s take a simple example. Imagine that you need to have a number of fictitious countries for your database. You feed in the actual countries and you get out fictitious, randomized, countries. Actually, you get fifteen percent of real countries and the rest are fictitious unless you filter out the real ones as we will be doing. This is a tweakable algorithm.

Let’s do it in a simple batch and see what happens. I’ll explain later.

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

/* we start off with a list of strings. In this case, it is the names of countries.

this can come from a file or from a table column. It could come from anywhere */

DECLARE @RawValues TABLE (theString nVARCHAR(30))

INSERT INTO @RawValues(theString)

SELECT country FROM (values('Afghanistan'), ('Albania'), ('Algeria'), ('American Samoa'),

('Andorra'), ('Angola'), ('Anguilla'), ('Antarctica'), ('Argentina'), ('Armenia'), ('Aruba'),

('Australia'), ('Austria'), ('Azerbaijan'), ('Bahamas'), ('Bahrain'), ('Bangladesh'), ('Barbados'),

('Belarus'), ('Belgium'), ('Belize'), ('Benin'), ('Bermuda'), ('Bhutan'), ('Bolivia'),

('Botswana'), ('Brazil'), ('Bulgaria'), ('Burkina Faso'), ('Burundi'), ('Cambodia'), ('Cameroon'),

('Canada'), ('Cape Verde'), ('Chile'), ('China'), ('Colombia'), ('Comoros'), ('Congo'),

('Cook Islands'),('Costa Rica'), ('Côte d''Ivoire'), ('Croatia'), ('Cuba'), ('Cyprus'),

('Czech Republic'), ('Denmark'), ('Djibouti'), ('Dominica'), ('Ecuador'), ('Egypt'), ('Eire'),

('El Salvador'), ('Eritrea'), ('Estonia'), ('Ethiopia'), ('Falklands'), ('Fiji'), ('Finland'),

('France'), ('French Guiana'), ('Gabon'), ('Gambia'), ('Georgia'), ('Germany'), ('Ghana'),

('Gibraltar'), ('Greece'), ('Greenland'), ('Grenada'), ('Guadeloupe'), ('Guam'), ('Guatemala'),

('Guernsey'), ('Guinea'), ('Guinea-Bissau'), ('Guyana'), ('Haiti'), ('Honduras'), ('Hong Kong'),

('Hungary'), ('Iceland'), ('India'), ('Indonesia'), ('Iran'), ('Iraq'), ('Isle of Man'),

('Israel'), ('Italy'), ('Jamaica'), ('Japan'), ('Jersey'), ('Jordan'), ('Kazakhstan'), ('Kenya'),

('Kiribati'), ('Kuwait'), ('Kyrgyzstan'), ('Latvia'), ('Lebanon'), ('Lesotho'), ('Liberia'),

('Libya'), ('Liechtenstein'), ('Lithuania'), ('Luxembourg'), ('Macao'), ('Macedonia'),

('Madagascar'), ('Malawi'), ('Malaysia'), ('Mali'), ('Malta'), ('Martinique'),

('Mauritania'), ('Mauritius'), ('Mayotte'), ('Mexico'), ('Micronesia'), ('Moldova'), ('Monaco'),

('Mongolia'), ('Montenegro'), ('Montserrat'), ('Morocco'), ('Mozambique'), ('Myanmar'),

('Namibia'), ('Nauru'), ('Nepal'), ('Netherlands'), ('New Caledonia'), ('New Zealand'),

('Nicaragua'), ('Niger'), ('Nigeria'), ('Niue'), ('North Korea'), ('Norway'), ('Oman'),

('Pakistan'), ('Palau'), ('Panama'), ('Paraguay'), ('Peru'), ('Philippines'), ('Pitcairn'),

('Poland'), ('Portugal'), ('Puerto Rico'), ('Qatar'), ('Réunion'), ('Romania'), ('Russia'),

('Rwanda'), ('Saint Helena'), ('Saint Lucia'), ('Samoa'), ('San Marino'), ('Saudi Arabia'),

('Senegal'), ('Serbia'), ('Seychelles'), ('Sierra Leone'), ('Singapore'), ('Slovakia'),

('Slovenia'), ('Somalia'), ('South Africa'), ('South Georgia'), ('South Korea'), ('Spain'),

('Sri Lanka'), ('Sudan'), ('Suriname'), ('Swaziland'), ('Sweden'), ('Switzerland'), ('Syria'),

('Taiwan'), ('Tajikistan'), ('Tanzania'), ('Thailand'), ('Timor-Leste'), ('Togo'), ('Tonga'),

('Trinidad'), ('Tunisia'), ('Turkey'), ('Tuvalu'), ('Uganda'), ('Ukraine'), ('United Kingdom'),

('United States'), ('Uruguay'), ('Uzbekistan'), ('Vanuatu'), ('Vatican City'), ('Venezuela'),

('Vietnam'), ('Virgin Islands'), ('Western Sahara'), ('Yemen'), ('Zambia'), ('Zimbabwe')

)f(country)

--Drop our Markov table if is already exists

IF OBJECT_ID('tempdb..#markov') IS NOT NULL DROP TABLE #markov

CREATE TABLE #Markov

(

trigraph NCHAR(3) NOT NULL,

iteration TINYINT NOT NULL,

frequency INT NOT NULL,

runningTotal INT NOT NULL

);

/* in our table, we will take every three-character trigraph in the word sequence and calculate

the frequency with which it appears in the collection passed to the routine. This is

simple to do with a GROUP BY clause.

The quickest way I've found of doing this is by a cross-join with a number

table (I've used a VALUE table here to reduce dependencies). The result is then

fed to a window expression that calculates the running total of the

frequencies. You'll need this to generate the right distribution of

three-character triads within the list of sample words

INSERT INTO #Markov(trigraph,iteration , frequency , runningTotal )

SELECT Frequencies.trigraph, Frequencies.Iteration, Frequencies.frequency,

Sum(Frequencies.frequency) OVER

(PARTITION BY Frequencies.Iteration ORDER BY Frequencies.frequency,trigraph)

AS RunningTotal

FROM

(

SELECT Substring(' ' + Coalesce(theString, '') + '|', f.iteration, 3) AS trigraph,

f.iteration, Count(*) AS frequency

FROM @RawValues AS p

CROSS JOIN

(VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

(31),(32),(33),(34),(35),(36),(37),(38),(39),(40)

) f(iteration)

WHERE Substring(' ' + Coalesce(theString, '') + '|', f.iteration, 3) <> ''

GROUP BY Substring(' ' + Coalesce(theString, '') + '|', f.iteration, 3),

f.iteration

) AS Frequencies(trigraph, Iteration, frequency)

ORDER BY Frequencies.Iteration;

/* we create a primary key on iteration and trigraph.

ALTER TABLE #Markov

ADD CONSTRAINT PK_Markov PRIMARY KEY NONCLUSTERED (iteration,trigraph);

DECLARE @GeneratedWord TABLE (name VARCHAR(30))

DECLARE @RowsWanted INT = (SELECT Count(*) FROM @RawValues)

DECLARE @jj INT = 1,@RowCount INT, @ii INT, @NewWord VARCHAR(50) = ' ';

DECLARE @EntireRange INT --for matching the sample distribution

= (SELECT Sum(frequency) FROM #Markov WHERE iteration=3)

WHILE (@jj <= @RowsWanted) --for as many data items needed

BEGIN

--insert the first trigraph randomly, randomly choosing tied ranks

DECLARE @random float=Rand();

SELECT @NewWord = trigraph

FROM #Markov

WHERE iteration=3

AND (@random * @EntireRange) BETWEEN (runningTotal-frequency) AND runningtotal

DECLARE @MaxLength INT = 50;

--now we can use markov chains to assemble the word

SELECT @ii = 4, @RowCount = 1;

WHILE ((@ii < @MaxLength) AND (@RowCount > 0))

BEGIN--we can make our choice based on its distribution but

--it tends to reproduce more of the real data and less made up

SELECT TOP 1 @NewWord = @NewWord + Right(M.trigraph, 1)

FROM #Markov AS M

WHERE M.iteration = @ii

AND M.trigraph LIKE Right(@NewWord, 2) + '_'

ORDER BY NewId();

SELECT @RowCount = @@RowCount;

SELECT @ii = @ii + 1;

END;

INSERT INTO @GeneratedWord SELECT Replace(LTrim(@NewWord), '| ', '');

SELECT @jj = @jj + 1;

END;

--we will just select the made-up names. Otherwise just select all

SELECT [@GeneratedWord].name

FROM @GeneratedWord LEFT OUTER join @rawvalues

ON name= TheString

WHERE TheString IS NULL

The results are like this but different every time, and all in a different order.

_{Palaysia, Jape of Man, Sermark, Monduras, Ukragaledone, Tanuatu, Sudandi, Bulginique, Belgary, Ghainia, Viribati, Soman, Czerbati, Huninica, Myan, Vataragua, Zamor-Ledor, Palta Rino, Freen, Huninica, Zamal, Tuvalia, Mauru, Amerra Leone, Madagal, Namboutica, Montserlands, South Arabia, Montserrabia, Zamanica, Costragua, Bulgary, Irazilla, Grence, Switria, Souta Republic, Kuwaica, Andia, Zamibatica, Barbabwe, Ukrael, Camanon, Andia, Sinlan, Swanada, Tontemala, Antarus, Puerra Leonia, Nepanon, Alba, Netherla, Icelanistates, Costerlands, Aussia, Aussia, Phinida, El Saltar, Virgyzstan, Siech Republic, Bolarus, Lestralia, New Cal, Andor-Ledone, Pordan, Marbia, Mayotho, Mexicana, Guinadad, Vieritania, Tuvalia, Myanda, Mayottenste, Honted Stan, Sriti, Eston, Indon, Thaila, Perbia, Armer, Camor-Lestein, Moli, Namen, Afghand, Branma, Virgyzstan, Berbia, Gerbabwe, Persey, Alba, Naurit, Fraq, Cubanica, Germuda, Souta, Braq, Arubabwe, Polawi, Nicronegro, Bolomboupe, Camorra, Côte of Man, Puerral, Guint Luciands, Spaila, Alba, Kuwaicania, Azech Korea, Siech Republic, Gibya, Kuwaitan Samoa, Unitzerrat, Vire, Domin, Saitenesia, Palivia, Eston, Romaican, Soutan, Mozakhstan, Indon, Green, Swanadad, Cambabwe, Latvinistan, Parba, Naudi Arat, Malvia, Saitrea, Armer, Réunidad, Kyrgium, Haitern City, Maudi Arabia, Montenistates, Colawi, Switina Faso, Rustanique, Philippino, Hainea, Domorras, Rominame, Phina, Nethuand, Andonesia, Rwazilan, Mauru, Leba, Serba, Sinlan, Giberia, Nortugala, Azech Korea, Timbodia, Alger, Senegalesh, Costreal, Yemeria, Hongapore, Amerto Rico, Denisia, South Arabia, Guina, Bermany, Sinlanistates, Camain Islands, Kire, Haiwanique, Azech Arabia and Mauru}

How did we do this? We started with a table that took each word, added two spaces at the beginning and a |, followed by two subsequent spaces, at the end. This allowed us to map the frequency of each three-letter combination in a collection of words. Any language is made up of common combinations of characters with a few wild exceptions. For words to look right, they must follow this distribution. This distribution will change in various parts of a word, so you need all this information.

So what would happen if, instead of feeding the name of countries into the batch, we do the names of people?

We can pull them from a file, one name per line, like this (changing the path to the file, of course)

DECLARE @RawValues TABLE (theString nVARCHAR(30))

INSERT INTO @RawValues(theString)

SELECT Value

FROM OpenJson(

(SELECT '["'+Replace(String_Escape

(Replace(Convert(VARCHAR(MAX),BulkColumn),Char(13)+Char(10),'|')

,'json')

,'|','","')

+'"]' FROM

OPENROWSET(Bulk 'C:\data\RawData\NamesFirstMale.txt', SINGLE_BLOB) [rowsetresults]))

And we get …

_{Clydent, Abeldon, Joelby, Coycent, Dary, Nath, Clex, Lorey, Alaig, Ellan, Coristine, Demuel, Jossel, Jimonso, Dono, Tronzo, Bricence, Wile, Bredd, Jefferly, Sebarow, Brahallus, Donal, Camun, Jaynelin, Dantiso, Ezrared, Jodrifer, Willy, Anto, Amosevel, Jarwin, Reynallus, Vantiago, Taug, Ivoryl, Linstanie, Bra, Tormelan, Sim, Robingo, Ahman, Blanett, Seattin, Pablory, Rahsance, Jametrius, Sterrey, Arien, Alfone, Noraine, Loycent, Ran, Mackiell, Brycey, Bobene, Stant, Samun, Richolph, Tonelo, Aubricey, Ezrac, Cyrust, Tomad, Judsonso, Dunist, Criente, Gernonder, Eltoin, Fede, Cortiney, Jaymund, Kevil, Marcence, Thadfore, Rictos, Noey, Abraelas, Demastian, Darik, Sans, Dewelio, Sheod, Kelsel, Bobb, Albenzo, Emark, Allipe, Mituro, Ceddy, Kenionso, Cormain, Carredo, Jam, Waysses, Stevin, Bary, Emmelo, Adandore, Briastian, Kenver, Steryl, Colbenzo, Eduari, Erne, Edgare, Chasevel, Barcolm, Daval, Gonovani, Wyatty, Simonso, Efraelas, Lucarton, Traviell, Traiah, Lenjam, Huguston, Leidro, Chelby, Brew, Emmald, Garyl, Angenzo, Jeanles, Jac, Abraine, Chuce, Zacques, Cesuelo, Nicke, Benell, Masence, Jeral, Eddiert, Gracy, Edgart, Jendony, Lormaing, Micenzo, Vituron, Roeltobal, Migen, Clayne, Tobiandro, Miturow, Edmothan, Adandre, Derlon, Kylero, Stanco, Carrayne, Collacey, Stevine, Veric, Nelis, Judsony, Delvis, Lanaron, Cruck, Guishall, Angerick, Ellerow, Zack, Clishan, Chanuel, Neaunce, Colert, Ronneline and Donelio}

…and so on. The more data we use for our Markov table, the more convincing the generated strings. The drawback to this sort of data generation comes for the person who has a very uncommon name with unusual combinations of characters. It might be possible to detect from the generated names that this person was on the original list. This is something we have to deal with later on by detecting and eliminating outliers , but I’ll explain that later in this series.

Mimicking Discrete Distributions

Rand(), like most random number generators, will give you random numbers between 0 and 1. It is a flat continuous distribution. If you’ve read my blogs, you’ll know how to get continuous normally distributed random numbers from this. You need far more flexibility than even this. You need to mimic the distribution of your data.

We will do this with discrete rather than continuous data. Imagine you have a column in a questionnaire. You allow “Yes”, “No”, “Occasionally”, “Maybe” and “Don’t know”. 85 people said ‘yes’, 54 said ‘no’, 21 said ‘occasionally’, 14 said ‘Maybe’ and 37 said “don’t know’

So how do you choose randomly in a way that matches this distribution?

You multiply RAND() by 211 in order to give you a number between 0 and 211 You then pick the answer whose running total is greater but nearest to your number

Here is the algorithm sketched out in SQL

DECLARE @QuestionnaireData TABLE (Question VARCHAR(15), Responses INT, RunningTotal INT);

INSERT INTO @QuestionnaireData (Question, Responses, RunningTotal)

SELECT results.Question, results.Responses, results.RunningTotal

FROM

(

VALUES ('Yes', 85, 85), ('No', 54, 139), ('Occasionally', 21, 160),

('Maybe', 14, 174), ('Don’t know', 37, 211)

) AS results (Question, Responses, RunningTotal);

DECLARE @result TABLE (answer VARCHAR(15) NOT null);

DECLARE @ii INT = 1, @iiMax INT = 211;

WHILE (@ii <= @iiMax)

BEGIN

INSERT INTO @result (answer)

SELECT TOP 1 [@QuestionnaireData].Question

FROM @QuestionnaireData

WHERE Rand() * 211 < [@QuestionnaireData].RunningTotal

ORDER BY [@QuestionnaireData].RunningTotal;

SELECT @ii = @ii + 1;

END;

SELECT answer, Count(*) AS responses FROM @result GROUP BY answer;

This will give you a different result every time you try it, but you’ll see that the results stay close to the original distribution of answers. This is just a sketch of the algortithm, and you’ll notice that there is an easier and quicker way of doing the join, but we’ll demonstrate that later on in the article, once we go set-based.

Using Markov Data

We will use this principle to choose the first three characters for our words, which for empty strings will be spaces and an end-stop, which are subsequently removed. For subsequent characters we can use the Markov data. What characters will be likely to follow the preceding two at a particular position in a word? We can even adjust the choice according to the probability, though this is likely to generate more of the original strings.

Here, as an example is the distribution data for the fourteenth position in the string.

The second column gives the character position in the string. The first column gives the three-character trigraph. At this stage, whatever the last two characters in your string may be, you have no choice: it is a simple lookup. If they are ar, then the next character must be ‘a’, if au, then you can only have l, and so on. If there had been two or more trigraphs starting ar or au then you would be able to roll the dice to decide. The column on the right after the frequency of the trigraph is the running total. We need this to select the appropriate character. The design of the Markov table I use has more information in it than is strictly necessary for the algorithm I’m using, but it allows you to try out different techniques according to the requirements you have. I select the third position in the string to start and select the entire trigraph randomly based on its frequency, because that will give a good approximation to the distribution of the original data. Then I select each character in turn entirely randomly so as to get a livelier divergence.

Going Set-based

We don’t much like iteration in SQL Server, in fact we don’t like it at all. However, any process that relies on random number presents difficulties, and we need to resort to modest trickery. We use a view that is as precious to keen data-spoofers like myself as a number table or calendar is to other SQL Server developers. It is a simple thing.

CREATE VIEW dbo.random

SELECT number from (

VALUES

(RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()),

(RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND())

) AS random (number)

…except I have much larger tables than this. With a 4000 row random table (SQL Server 2017), I do this….

SELECT answer, (count(*)/4000.00)*211.00 FROM

(

VALUES ('Yes', 85, 85,1), ('No', 54, 139,2), ('Occasionally', 21, 160,3),

('Maybe', 14, 174,4), ('Don’t know', 37, 211,5)

) AS results (answer, Responses, RunningTotal,theOrder)

INNER JOIN random ON (number*211) BETWEEN (runningTotal-responses) AND runningTotal

GROUP BY answer

This gives the result …

answer proportions

------------ ---------------------------------------

Yes 85.402250000

No 53.224750000

Occasionally 19.886750000

Maybe 14.506250000

Don’t know 37.980000000

…which tells us that our distribution almost exactly matches the one we originally specified. We can now fake a questionnaire very closely!

One huge advantage we now have is that we are allowed to put this in a table-valued function. We can’t do this with anything that has the rand() function because there is a risk to having functions that don’t return the same outputs for the same input. They restrict optimisation. By cheating we run a risk because we now have a function that has to be executed afresh every time. This, however, is what we want.

So with, health warnings, here is the table-valued function. Here we are working to percentages rather than actual figures for the distribution.

IF Object_Id('RandomAnswers') IS NOT NULL

DROP function RandomAnswers

CREATE FUNCTION RandomAnswers

/**

Summary: >

gives responses randomly, given a distribution of

Yes', 51%

'No', 22%

'Occasionally', 16%

'Maybe', 7%

'Don’t know', 4%

Author: Phil Factor

Date: 12/07/2018

Examples:

- Select * from RandomAnswers(50)

Returns: >

table with a simple list of random answers

**/

(

@RowsToReturn int

)

RETURNS TABLE

RETURN

(

SELECT TOP (@RowsToReturn) results.answer

FROM

(

VALUES ('Yes', 51, 51), ('No', 22, 73),

('Occasionally', 16, 89), ('Maybe', 7, 96),

('Don’t know', 4, 100)

) AS results (answer, Responses, RunningTotal)

INNER JOIN

(

SELECT random.number, Row_Number() OVER (ORDER BY (SELECT 1)) AS theorder

FROM dbo.random

) AS f(number, theOrder)

ON (f.number * 211)

BETWEEN (results.RunningTotal - results.Responses) AND results.RunningTotal

ORDER BY f.theOrder

)

Onwards

So, we now seem to have an effective way of generating all sorts of data, and, in particular, the names of things for pseudonymizing data. At this stage, we have plenty of questions still. How effective is it for rule-based strings such as book numbers or postcodes? Just how far can we use this algorithm to generate data? How can we, in practical terms, do the work? How can you generate convincing data without seeing the original data? How do you generate continuous as well as discrete distributions?

We will continue!

Register for Simple Talk

Spoofing Data Convincingly for Database Development. An Introduction

Generating fake country names

Mimicking Discrete Distributions

Using Markov Data

Going Set-based

Onwards

Article tags

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases

Generating fake country names

Mimicking Discrete Distributions

Using Markov Data

Going Set-based

Onwards

Article tags

Recommended

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases