{"id":79738,"date":"2018-07-12T16:39:43","date_gmt":"2018-07-12T16:39:43","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=79738"},"modified":"2024-12-12T12:18:42","modified_gmt":"2024-12-12T12:18:42","slug":"spoofing-data-convincingly-for-database-development-an-introduction","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-for-database-development-an-introduction\/","title":{"rendered":"Spoofing Data Convincingly for Database Development. An Introduction"},"content":{"rendered":"<p>Many times I\u2019ve been told, by developers who are using live data to develop a database, that it is impossible to anonymise or pseudonymize their data to comply with privacy legislation. One recurring explanation is that one can\u2019t duplicate the distribution of data by faking it. In a sense, this is true, because if you change data you always upset some part of it, such as the number of people called \u2018Nigel\u2019, living in Montana. However, this is likely to be a minor inconvenience compared with the risks of exposing personal, financial or business data. Other than this, it is relatively easy to have as much data as you want that looks convincing and conforms to the distribution of the real data. Not only that, but to do it, you need never have access to the real data. All you need to know is the distribution of data.<\/p>\n<p>This article sets the scene for a whole &#8216;<em>spoofing data&#8230;<\/em>&#8216; series:<\/p>\n<ul>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-altering-table-data\/\">Spoofing Data Convincingly: Altering Table Data<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-masking-continuous-variables\/\">Spoofing Data Convincingly: Masking Continuous Variables<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-credit-cards\/\">Spoofing Data Convincingly: Credit Cards<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-doing-the-shuffle-and-zip\/\">Spoofing Data Convincingly: Doing the Shuffle and Zip<\/a><\/li>\n<\/ul>\n<h2>Generating fake country names<\/h2>\n<p>Let\u2019s take a simple example. Imagine that you need to have a number of fictitious countries for your database. You feed in the actual countries and you get out fictitious, randomized, countries. Actually, you get fifteen percent of real countries and the rest are fictitious unless you filter out the real ones as we will be doing. This is a tweakable algorithm.<\/p>\n<p>Let\u2019s do it in a simple batch and see what happens. I\u2019ll explain later.<\/p>\n<pre class=\"theme:ssms2012-simple-talk font:consolas font-size:13 line-height:16 wrap:true lang:tsql decode:true\">\/* we start off with a list of strings. In this case, it is the names of countries. \nthis can come from a file or from a table column. It could come from anywhere *\/\nDECLARE @RawValues TABLE (theString nVARCHAR(30))\nINSERT INTO @RawValues(theString)\nSELECT country FROM (values('Afghanistan'), ('Albania'), ('Algeria'), ('American Samoa'),\n('Andorra'), ('Angola'), ('Anguilla'), ('Antarctica'), ('Argentina'), ('Armenia'), ('Aruba'),\n('Australia'), ('Austria'), ('Azerbaijan'), ('Bahamas'), ('Bahrain'), ('Bangladesh'), ('Barbados'), \n('Belarus'), ('Belgium'), ('Belize'), ('Benin'), ('Bermuda'), ('Bhutan'), ('Bolivia'), \n('Botswana'), ('Brazil'), ('Bulgaria'), ('Burkina Faso'), ('Burundi'), ('Cambodia'), ('Cameroon'),\n('Canada'), ('Cape Verde'), ('Chile'), ('China'), ('Colombia'), ('Comoros'), ('Congo'),\n('Cook Islands'),('Costa Rica'), ('C\u00f4te d''Ivoire'), ('Croatia'), ('Cuba'), ('Cyprus'),\n('Czech Republic'), ('Denmark'), ('Djibouti'), ('Dominica'), ('Ecuador'), ('Egypt'), ('Eire'),\n('El Salvador'), ('Eritrea'), ('Estonia'), ('Ethiopia'), ('Falklands'), ('Fiji'), ('Finland'),\n('France'), ('French Guiana'), ('Gabon'), ('Gambia'), ('Georgia'), ('Germany'), ('Ghana'), \n('Gibraltar'), ('Greece'), ('Greenland'), ('Grenada'), ('Guadeloupe'), ('Guam'), ('Guatemala'),\n('Guernsey'), ('Guinea'), ('Guinea-Bissau'), ('Guyana'), ('Haiti'), ('Honduras'), ('Hong Kong'),\n('Hungary'), ('Iceland'), ('India'), ('Indonesia'), ('Iran'), ('Iraq'), ('Isle of Man'), \n('Israel'), ('Italy'), ('Jamaica'), ('Japan'), ('Jersey'), ('Jordan'), ('Kazakhstan'), ('Kenya'),\n('Kiribati'), ('Kuwait'), ('Kyrgyzstan'), ('Latvia'), ('Lebanon'), ('Lesotho'), ('Liberia'),\n('Libya'), ('Liechtenstein'), ('Lithuania'), ('Luxembourg'), ('Macao'), ('Macedonia'),\n('Madagascar'), ('Malawi'), ('Malaysia'), ('Mali'), ('Malta'), ('Martinique'),\n('Mauritania'), ('Mauritius'), ('Mayotte'), ('Mexico'), ('Micronesia'), ('Moldova'), ('Monaco'),\n('Mongolia'), ('Montenegro'), ('Montserrat'), ('Morocco'), ('Mozambique'), ('Myanmar'),\n('Namibia'), ('Nauru'), ('Nepal'), ('Netherlands'), ('New Caledonia'), ('New Zealand'),\n('Nicaragua'), ('Niger'), ('Nigeria'), ('Niue'), ('North Korea'), ('Norway'), ('Oman'),\n('Pakistan'), ('Palau'), ('Panama'), ('Paraguay'), ('Peru'), ('Philippines'), ('Pitcairn'),\n('Poland'), ('Portugal'), ('Puerto Rico'), ('Qatar'), ('R\u00e9union'), ('Romania'), ('Russia'),\n('Rwanda'), ('Saint Helena'), ('Saint Lucia'), ('Samoa'), ('San Marino'), ('Saudi Arabia'),\n('Senegal'), ('Serbia'), ('Seychelles'), ('Sierra Leone'), ('Singapore'), ('Slovakia'),\n('Slovenia'), ('Somalia'), ('South Africa'), ('South Georgia'), ('South Korea'), ('Spain'), \n('Sri Lanka'), ('Sudan'), ('Suriname'), ('Swaziland'), ('Sweden'), ('Switzerland'), ('Syria'), \n('Taiwan'), ('Tajikistan'), ('Tanzania'), ('Thailand'), ('Timor-Leste'), ('Togo'), ('Tonga'), \n('Trinidad'), ('Tunisia'), ('Turkey'), ('Tuvalu'), ('Uganda'), ('Ukraine'), ('United Kingdom'), \n('United States'), ('Uruguay'), ('Uzbekistan'), ('Vanuatu'), ('Vatican City'), ('Venezuela'), \n('Vietnam'), ('Virgin Islands'), ('Western Sahara'), ('Yemen'), ('Zambia'), ('Zimbabwe')\n)f(country)\n \n \n--Drop our Markov table if is already exists\nIF OBJECT_ID('tempdb..#markov') IS NOT NULL DROP TABLE #markov\n \nCREATE TABLE #Markov\n  (\n  trigraph NCHAR(3) NOT NULL,\n  iteration TINYINT NOT NULL,\n  frequency INT NOT NULL,\n  runningTotal INT NOT NULL\n  );\n\/* in our table, we will take every three-character trigraph in the word sequence and calculate\nthe frequency with which it appears in the collection passed to the routine. This is \nsimple to do with a GROUP BY clause. \nThe quickest way I've found of doing this is by a cross-join with a number \ntable (I've used a VALUE table here to reduce dependencies). The result is then\nfed to a window expression that calculates the running total of the\nfrequencies. You'll need this to generate the right distribution of \nthree-character triads within the list of sample words\n*\/\n \nINSERT INTO #Markov(trigraph,iteration , frequency , runningTotal )\n SELECT Frequencies.trigraph, Frequencies.Iteration, Frequencies.frequency,\n  Sum(Frequencies.frequency) OVER \n    (PARTITION BY Frequencies.Iteration ORDER BY Frequencies.frequency,trigraph)\n    AS RunningTotal\n  FROM\n    (\n    SELECT Substring('  ' + Coalesce(theString, '') + '|', f.iteration, 3) AS trigraph,\n      f.iteration, Count(*) AS frequency\n      FROM @RawValues AS p\n        CROSS JOIN\n\t\t(VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),\n\t\t(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),\n\t\t(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),\n\t\t(31),(32),(33),(34),(35),(36),(37),(38),(39),(40)\n\t\t) f(iteration)\n      WHERE Substring('  ' + Coalesce(theString, '') + '|', f.iteration, 3) &lt;&gt; ''\n      GROUP BY Substring('  ' + Coalesce(theString, '') + '|', f.iteration, 3),\n      f.iteration\n    ) AS Frequencies(trigraph, Iteration, frequency)\n  ORDER BY Frequencies.Iteration;\n\/* we create a primary key on iteration and trigraph.\n*\/\nALTER TABLE #Markov\nADD CONSTRAINT PK_Markov PRIMARY KEY NONCLUSTERED (iteration,trigraph);\nDECLARE @GeneratedWord TABLE (name VARCHAR(30))\nDECLARE @RowsWanted INT = (SELECT Count(*) FROM @RawValues)\nDECLARE @jj INT = 1,@RowCount INT, @ii INT, @NewWord VARCHAR(50) = '  ';\nDECLARE @EntireRange INT --for matching the sample distribution \n   = (SELECT Sum(frequency) FROM #Markov WHERE iteration=3)\nWHILE (@jj &lt;= @RowsWanted) --for as many data items needed\n  BEGIN\n\t--insert the first trigraph randomly, randomly choosing tied ranks \n   DECLARE  @random float=Rand();\n   SELECT  @NewWord = trigraph \n\tFROM #Markov\n\t  WHERE iteration=3 \n\t  AND  (@random * @EntireRange) BETWEEN (runningTotal-frequency) AND runningtotal\n    DECLARE @MaxLength INT = 50;\n\t--now we can use markov chains to assemble the word\n    SELECT @ii = 4, @RowCount = 1;\n    WHILE ((@ii &lt; @MaxLength) AND (@RowCount &gt; 0))\n      BEGIN--we can make our choice based on its distribution but \n\t  --it tends to reproduce more of the real data and less made up\n        SELECT TOP 1 @NewWord = @NewWord + Right(M.trigraph, 1)\n          FROM #Markov AS M\n          WHERE M.iteration = @ii\n            AND M.trigraph LIKE Right(@NewWord, 2) + '_'\n          ORDER BY NewId();\n        SELECT @RowCount = @@RowCount;\n        SELECT @ii = @ii + 1;\n      END;\n    INSERT INTO @GeneratedWord SELECT Replace(LTrim(@NewWord), '|  ', '');\n    SELECT @jj = @jj + 1;\n  END;\n--we will just select the made-up names. Otherwise just select all \nSELECT [@GeneratedWord].name \nFROM @GeneratedWord LEFT OUTER join @rawvalues\nON name= TheString\nWHERE TheString IS NULL\n<\/pre>\n<p>The results are like this but different every time, and all in a different order.<\/p>\n<p><sub>Palaysia, Jape of Man, Sermark, Monduras, Ukragaledone, Tanuatu, Sudandi, Bulginique, Belgary, Ghainia, Viribati, Soman, Czerbati, Huninica, Myan, Vataragua, Zamor-Ledor, Palta Rino, Freen, Huninica, Zamal, Tuvalia, Mauru, Amerra Leone, Madagal, Namboutica, Montserlands, South Arabia, Montserrabia, Zamanica, Costragua, Bulgary, Irazilla, Grence, Switria, Souta Republic, Kuwaica, Andia, Zamibatica, Barbabwe, Ukrael, Camanon, Andia, Sinlan, Swanada, Tontemala, Antarus, Puerra Leonia, Nepanon, Alba, Netherla, Icelanistates, Costerlands, Aussia, Aussia, Phinida, El Saltar, Virgyzstan, Siech Republic, Bolarus, Lestralia, New Cal, Andor-Ledone, Pordan, Marbia, Mayotho, Mexicana, Guinadad, Vieritania, Tuvalia, Myanda, Mayottenste, Honted Stan, Sriti, Eston, Indon, Thaila, Perbia, Armer, Camor-Lestein, Moli, Namen, Afghand, Branma, Virgyzstan, Berbia, Gerbabwe, Persey, Alba, Naurit, Fraq, Cubanica, Germuda, Souta, Braq, Arubabwe, Polawi, Nicronegro, Bolomboupe, Camorra, C\u00f4te of Man, Puerral, Guint Luciands, Spaila, Alba, Kuwaicania, Azech Korea, Siech Republic, Gibya, Kuwaitan Samoa, Unitzerrat, Vire, Domin, Saitenesia, Palivia, Eston, Romaican, Soutan, Mozakhstan, Indon, Green, Swanadad, Cambabwe, Latvinistan, Parba, Naudi Arat, Malvia, Saitrea, Armer, R\u00e9unidad, Kyrgium, Haitern City, Maudi Arabia, Montenistates, Colawi, Switina Faso, Rustanique, Philippino, Hainea, Domorras, Rominame, Phina, Nethuand, Andonesia, Rwazilan, Mauru, Leba, Serba, Sinlan, Giberia, Nortugala, Azech Korea, Timbodia, Alger, Senegalesh, Costreal, Yemeria, Hongapore, Amerto Rico, Denisia, South Arabia, Guina, Bermany, Sinlanistates, Camain Islands, Kire, Haiwanique, Azech Arabia and Mauru<\/sub><\/p>\n<p>How did we do this? We started with a table that took each word, added two spaces at the beginning and a |, followed by two subsequent spaces, at the end. This allowed us to map the frequency of each three-letter combination in a collection of words. Any language is made up of common combinations of characters with a few wild exceptions. For words to look right, they must follow this distribution. This distribution will change in various parts of a word, so you need all this information.<\/p>\n<p>So what would happen if, instead of feeding the name of countries into the batch, we do the names of people?<\/p>\n<p>We can pull them from a file, one name per line, like this (changing the path to the file, of course)<\/p>\n<pre class=\"theme:ssms2012-simple-talk font:consolas font-size:13 line-height:16 wrap:true lang:tsql decode:true\">DECLARE @RawValues TABLE (theString nVARCHAR(30))\nINSERT INTO @RawValues(theString)\n   SELECT Value\n  FROM OpenJson( \n  (SELECT '[\"'+Replace(String_Escape\n\t\t\t\t(Replace(Convert(VARCHAR(MAX),BulkColumn),Char(13)+Char(10),'|')\n\t\t\t   ,'json')\n\t\t\t ,'|','\",\"')\n\t\t   +'\"]' FROM   \n   OPENROWSET(Bulk 'C:\\data\\RawData\\NamesFirstMale.txt', SINGLE_BLOB) [rowsetresults]))<\/pre>\n<p>And we get \u2026<\/p>\n<p><sub>Clydent, Abeldon, Joelby, Coycent, Dary, Nath, Clex, Lorey, Alaig, Ellan, Coristine, Demuel, Jossel, Jimonso, Dono, Tronzo, Bricence, Wile, Bredd, Jefferly, Sebarow, Brahallus, Donal, Camun, Jaynelin, Dantiso, Ezrared, Jodrifer, Willy, Anto, Amosevel, Jarwin, Reynallus, Vantiago, Taug, Ivoryl, Linstanie, Bra, Tormelan, Sim, Robingo, Ahman, Blanett, Seattin, Pablory, Rahsance, Jametrius, Sterrey, Arien, Alfone, Noraine, Loycent, Ran, Mackiell, Brycey, Bobene, Stant, Samun, Richolph, Tonelo, Aubricey, Ezrac, Cyrust, Tomad, Judsonso, Dunist, Criente, Gernonder, Eltoin, Fede, Cortiney, Jaymund, Kevil, Marcence, Thadfore, Rictos, Noey, Abraelas, Demastian, Darik, Sans, Dewelio, Sheod, Kelsel, Bobb, Albenzo, Emark, Allipe, Mituro, Ceddy, Kenionso, Cormain, Carredo, Jam, Waysses, Stevin, Bary, Emmelo, Adandore, Briastian, Kenver, Steryl, Colbenzo, Eduari, Erne, Edgare, Chasevel, Barcolm, Daval, Gonovani, Wyatty, Simonso, Efraelas, Lucarton, Traviell, Traiah, Lenjam, Huguston, Leidro, Chelby, Brew, Emmald, Garyl, Angenzo, Jeanles, Jac, Abraine, Chuce, Zacques, Cesuelo, Nicke, Benell, Masence, Jeral, Eddiert, Gracy, Edgart, Jendony, Lormaing, Micenzo, Vituron, Roeltobal, Migen, Clayne, Tobiandro, Miturow, Edmothan, Adandre, Derlon, Kylero, Stanco, Carrayne, Collacey, Stevine, Veric, Nelis, Judsony, Delvis, Lanaron, Cruck, Guishall, Angerick, Ellerow, Zack, Clishan, Chanuel, Neaunce, Colert, Ronneline and Donelio<\/sub><\/p>\n<p>\u2026and so on. The more data we use for our Markov table, the more convincing the generated strings. The drawback to this sort of data generation comes for the person who has a very uncommon name with unusual combinations of characters.\u00a0 It might be possible to detect from the generated names that this person was on the original list. This is something we have to deal with later on by detecting and eliminating outliers , but I&#8217;ll explain that later in this series.<\/p>\n<h2>Mimicking Discrete Distributions<\/h2>\n<p><strong>Rand(),<\/strong> like most random number generators, will give you random numbers between 0 and 1. It is a flat continuous distribution. If you\u2019ve read my blogs, you\u2019ll know how to get <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/getting-normally-distributed-random-numbers-in-tsql\/\">continuous normally distributed random numbers<\/a> from this. You need far more flexibility than even this. You need to mimic the distribution of your data.<\/p>\n<p>We will do this with discrete rather than continuous data. Imagine you have a column in a questionnaire. You allow \u201cYes\u201d, \u201cNo\u201d, \u201cOccasionally\u201d, \u201cMaybe\u201d and \u201cDon\u2019t know\u201d. 85 people said \u2018yes\u2019, 54 said \u2018no\u2019, 21 said \u2018occasionally\u2019, 14 said \u2018Maybe\u2019 and 37 said \u201cdon\u2019t know\u2019<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"378\" height=\"158\" class=\"wp-image-79739\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-67.png\" \/><\/p>\n<p>So how do you choose randomly in a way that matches this distribution?<\/p>\n<p>You multiply RAND() by 211 in order to give you a number between 0 and 211 You then pick the answer whose running total is greater but nearest to your number<\/p>\n<p>Here is the algorithm sketched out in SQL<\/p>\n<pre class=\"theme:ssms2012-simple-talk font:consolas font-size:13 line-height:16 wrap:true lang:tsql decode:true\">DECLARE @QuestionnaireData TABLE (Question VARCHAR(15), Responses INT, RunningTotal INT);\nINSERT INTO @QuestionnaireData (Question, Responses, RunningTotal)\n  SELECT results.Question, results.Responses, results.RunningTotal\n    FROM\n      (\n      VALUES ('Yes', 85, 85), ('No', 54, 139), ('Occasionally', 21, 160),\n        ('Maybe', 14, 174), ('Don\u2019t know', 37, 211)\n      ) AS results (Question, Responses, RunningTotal);\nDECLARE @result TABLE (answer VARCHAR(15) NOT null);\nDECLARE @ii INT = 1, @iiMax INT = 211;\nWHILE (@ii &lt;= @iiMax)\n  BEGIN\n    INSERT INTO @result (answer)\n      SELECT TOP 1 [@QuestionnaireData].Question\n        FROM @QuestionnaireData\n        WHERE Rand() * 211 &lt; [@QuestionnaireData].RunningTotal\n        ORDER BY [@QuestionnaireData].RunningTotal;\n    SELECT @ii = @ii + 1;\n  END;\nSELECT answer, Count(*) AS responses FROM @result GROUP BY answer;\n<\/pre>\n<p>This will give you a different result every time you try it, but you\u2019ll see that the results stay close to the original distribution of answers. This is just a sketch of the algortithm, and you&#8217;ll notice that there is an easier and quicker way of doing the join, but we&#8217;ll demonstrate that later on in the article, once we go set-based.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"191\" height=\"155\" class=\"wp-image-79740\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-68.png\" \/><\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"192\" height=\"156\" class=\"wp-image-79741\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-69.png\" \/><\/p>\n<h2>Using Markov Data<\/h2>\n<p>We will use this principle to choose the first three characters for our words, which for empty strings will be spaces and an end-stop, which are subsequently removed. For subsequent characters we can use the Markov data. What characters will be likely to follow the preceding two at a particular position in a word? We can even adjust the choice according to the probability, though this is likely to generate more of the original strings.<\/p>\n<p>Here, as an example is the distribution data for the fourteenth position in the string.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"218\" height=\"346\" class=\"wp-image-79742\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-70.png\" \/><\/p>\n<p>The second column gives the character position in the string. The first column gives the three-character trigraph. At this stage, whatever the last two characters in your string may be, you have no choice: it is a simple lookup. If they are <strong><em>ar<\/em><\/strong>, then the next character must be \u2018<strong><em>a<\/em><\/strong>\u2019, if <strong><em>au<\/em><\/strong>, then you can only have<strong><em> l<\/em><\/strong>, and so on. If there had been two or more trigraphs starting <strong><em>ar<\/em><\/strong> or <strong><em>au<\/em><\/strong> then you would be able to roll the dice to decide. The column on the right after the frequency of the trigraph is the running total. We need this to select the appropriate character. The design of the Markov table I use has more information in it than is strictly necessary for the algorithm I&#8217;m using, but it allows you to try out different techniques according to the requirements you have. I select the third position in the string to start and select the entire trigraph randomly based on its frequency, because that will give a good approximation to the distribution of the original data. Then I select each character in turn entirely randomly so as to get a livelier divergence.<\/p>\n<h2>Going Set-based<\/h2>\n<p>We don\u2019t much like iteration in SQL Server, in fact we don\u2019t like it at all. However, any process that relies on random number presents difficulties, and we need to resort to modest trickery. We use a view that is as precious to keen data-spoofers like myself as a number table or calendar is to other SQL Server developers. It is a simple thing.<\/p>\n<pre class=\"theme:ssms2012-simple-talk font:consolas font-size:13 line-height:16 wrap:true lang:tsql decode:true\">CREATE VIEW dbo.random\nAS\nSELECT number from (\n      VALUES \n\t    (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), \n\t    (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), \n\t    (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND()), (RAND())\n      ) AS random (number)\n<\/pre>\n<p>\u2026except I have much larger tables than this. With a 4000 row random table (SQL Server 2017), I do this\u2026.<\/p>\n<pre class=\"theme:ssms2012-simple-talk font:consolas font-size:13 line-height:16 wrap:true lang:tsql decode:true\">SELECT answer, (count(*)\/4000.00)*211.00 FROM \n      (\n      VALUES ('Yes', 85, 85,1), ('No', 54, 139,2), ('Occasionally', 21, 160,3),\n        ('Maybe', 14, 174,4), ('Don\u2019t know', 37, 211,5)\n      ) AS results (answer, Responses, RunningTotal,theOrder)\n\t  INNER JOIN random ON (number*211) BETWEEN (runningTotal-responses) AND runningTotal\nGROUP BY answer\n<\/pre>\n<p>This gives the result \u2026<\/p>\n<pre class=\"theme:none lang:tsql decode:true\">answer proportions\n------------ ---------------------------------------\nYes          85.402250000\nNo           53.224750000\nOccasionally 19.886750000\nMaybe        14.506250000\nDon\u2019t know   37.980000000<\/pre>\n<p>\u2026which tells us that our distribution almost exactly matches the one we originally specified. We can now fake a questionnaire very closely!<\/p>\n<p>One huge advantage we now have is that we are allowed to put this in a table-valued function. We can\u2019t do this with anything that has the <strong>rand()<\/strong> function because there is a risk to having functions that don\u2019t return the same outputs for the same input. They restrict optimisation. By cheating we run a risk because we now have a function that has to be executed afresh every time. This, however, is what we want.<\/p>\n<p>So with, health warnings, here is the table-valued function. Here we are working to percentages rather than actual figures for the distribution.<\/p>\n<pre class=\"theme:ssms2012-simple-talk font:consolas font-size:13 line-height:16 wrap:true lang:tsql decode:true \">IF Object_Id('RandomAnswers') IS NOT NULL\n   DROP function RandomAnswers\nGO\nCREATE FUNCTION RandomAnswers\n\/**\nSummary: &gt; \ngives responses randomly, given a distribution of \nYes',  51%\n'No', 22%\n'Occasionally', 16%\n'Maybe', 7%\n'Don\u2019t know', 4%\n Author: Phil Factor\nDate: 12\/07\/2018\nExamples:\n   - Select * from RandomAnswers(50)\nReturns: &gt;\ntable with a simple list of random answers \n        **\/\n  (\n  @RowsToReturn int\n  )\nRETURNS TABLE\nAS\nRETURN\n  (\n  SELECT TOP (@RowsToReturn) results.answer\n    FROM\n      (\n      VALUES ('Yes',  51, 51), ('No', 22, 73),\n        ('Occasionally', 16, 89), ('Maybe', 7, 96),\n        ('Don\u2019t know', 4, 100)\n      ) AS results (answer, Responses, RunningTotal)\n      INNER JOIN\n        (\n        SELECT random.number, Row_Number() OVER (ORDER BY (SELECT 1)) AS theorder\n          FROM dbo.random\n        ) AS f(number, theOrder)\n        ON (f.number * 211) \n          BETWEEN (results.RunningTotal - results.Responses) AND results.RunningTotal\n    ORDER BY f.theOrder\n  )\nGO\n<\/pre>\n<h2>Onwards<\/h2>\n<p>So, we now seem to have an effective way of generating all sorts of data, and, in particular, the names of things for pseudonymizing data. At this stage, we have plenty of questions still. How effective is it for rule-based strings such as book numbers or postcodes? Just how far can we use this algorithm to generate data? How can we, in practical terms, do the work? How can you generate convincing data without seeing the original data? How do you generate continuous as well as discrete distributions?<\/p>\n<p>We will continue!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Many times I\u2019ve been told, by developers who are using live data to develop a database, that it is impossible to anonymise or pseudonymize their data to comply with privacy legislation. One recurring explanation is that one can\u2019t duplicate the distribution of data by faking it. In a sense, this is true, because if you&#8230;&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[68855],"coauthors":[6813],"class_list":["post-79738","post","type-post","status-publish","format-standard","hentry","category-blogs","tag-sql-provision"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79738","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=79738"}],"version-history":[{"count":12,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79738\/revisions"}],"predecessor-version":[{"id":104856,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79738\/revisions\/104856"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=79738"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=79738"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=79738"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=79738"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}