{"id":80366,"date":"2018-08-13T18:08:48","date_gmt":"2018-08-13T18:08:48","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=80366"},"modified":"2018-08-13T18:08:48","modified_gmt":"2018-08-13T18:08:48","slug":"spoofing-data-convincingly-doing-the-shuffle-and-zip","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-doing-the-shuffle-and-zip\/","title":{"rendered":"Spoofing Data Convincingly: Doing the Shuffle and Zip"},"content":{"rendered":"<p>Having spent a lot of my working life trying to preserve the integrity of data, there was a certain intriguing novelty in the idea of pseudonymizing data. One of the standard techniques of pseudonymization is that of shuffling data columns as though you are shuffling cards. The original values are kept but placed in the wrong rows. The problems come with uncommon values. If you are pseudonymizing a medical database that is required for research purposes on people with potentially embarrassing diseases, and it appears on the dark web, anyone with a rare or unusual surname or first-name comes up on the list, so the shuffle doesn\u2019t help the privacy of Fortescue Ceresole, or whatever his name may be.<\/p>\n<p>If you are spoofing data entirely, you don\u2019t necessarily have this problem because your constructed value will have no relationship to the original value. If it comes from a list of common names or if you randomly create a name \u2018Thomas\u2019, it will have no relationship to the original names in the database as long as you did things correctly and shuffle the list. Although a Markov string can produce an identical name that is uncommon, it can be eliminated from the list by an outer join with the original data.<\/p>\n<p>After you shuffle data, you \u2018zip\u2019 it. Zipping lists is something you come across in procedural programming, and <a href=\"https:\/\/msdn.microsoft.com\/en-us\/library\/dd267698%28v=vs.100%29.aspx?f=255&amp;MSPPError=-2147217396\">Linq has a good example<\/a>. A .net array has an order, and all you are doing is to join by the order of the element in the list. If you randomize that order, you get a shuffle.<\/p>\n<p>There are several ways to randomize the order of a list in SQL Server. Probably the easiest is to use the ORDER BY NewID() trick. This produces a different order every time. Here is a worked example using the superior window function.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SELECT TheFirstName.FirstName + ' ' + TheLastName.lastName AS TheName\r\n  FROM\r\n    (\r\n    SELECT Person.FirstName, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n      FROM AdventureWorks2016.Person.Person\r\n    ) AS TheFirstName(FirstName, TheKEY)\r\n    INNER JOIN\r\n      (\r\n      SELECT Person.LastName, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n        FROM AdventureWorks2016.Person.Person\r\n      ) AS TheLastName(lastName, TheKEY)\r\n      ON TheFirstName.TheKEY = TheLastName.TheKEY;<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"153\" height=\"403\" class=\"wp-image-80367\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/08\/word-image-46.png\" \/><\/p>\n<p>You can, of course, use static lists as tables that you keep for the purpose. (Yes, sir: I\u2019m an avid collector). We\u2019ll add an address.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SELECT TheFirstName.FirstName + ' ' + TheLastName.lastName + '; '\r\n       + TheTown.Town + ', ' + TheCity.city + ', ' + Thecounty.county\r\n       + ', UK' COLLATE DATABASE_DEFAULT AS TheNameAndAddress\r\n  FROM\r\n    (\r\n    SELECT Person.FirstName, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n      FROM AdventureWorks2016.Person.Person\r\n    ) AS TheFirstName(FirstName, TheKEY)\r\n    INNER JOIN\r\n      (\r\n      SELECT Person.LastName, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n        FROM AdventureWorks2016.Person.Person\r\n      ) AS TheLastName(lastName, TheKEY)\r\n      ON TheFirstName.TheKEY = TheLastName.TheKEY\r\n    INNER JOIN\r\n      (\r\n      SELECT town.town, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n        FROM dbo.town\r\n      ) AS TheTown(Town, TheKEY)\r\n      ON TheFirstName.TheKEY = TheTown.TheKEY\r\n    INNER JOIN\r\n      (\r\n      SELECT city.city, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n        FROM dbo.city\r\n      ) AS TheCity(city, TheKEY)\r\n      ON TheFirstName.TheKEY = TheCity.TheKEY\r\n    INNER JOIN\r\n      (\r\n      SELECT county.county, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n        FROM dbo.county\r\n      ) AS Thecounty(county, TheKEY)\r\n      ON TheFirstName.TheKEY = Thecounty.TheKEY;\r\n<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"384\" height=\"397\" class=\"wp-image-80368\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/08\/word-image-47.png\" \/><\/p>\n<p>It is actually much easier to do this type of process in SQL Data Generator from text lists if you are creating entirely new data for tests. The complication of doing it in code\u00a0 is that you need more of every list than the rows you have in the dataset. In this case we only produced only ninety-four items because there are only that many counties. You could of course produce a table of 94000 counties easily enough, but it is a bit clunky.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SELECT county.county, Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n  FROM dbo.county \r\n    CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10))f(number) \r\n    CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10))g(number) \r\n    CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10))h(number) \r\n    CROSS JOIN (VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10))i(number)\r\n<\/pre>\n<p>We will end by simply pseudonymizing a table. I know that the Adventureworks <strong>person.person<\/strong> table isn\u2019t real data but it will do as a demonstration, and it allows you to play along. Here we simply shuffle all the elements in the name and leave everything else so that keys are intact. Because we are just shuffling data, we get the correct number of rows.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true  \">DECLARE @Title TABLE (title NVARCHAR(80) NULL, TheKey INT NOT NULL PRIMARY KEY);\r\nDECLARE @FirstName TABLE (FirstName NVARCHAR(50) NOT NULL, TheKey INT NOT NULL PRIMARY KEY);\r\nDECLARE @MiddleName TABLE (MiddleName NVARCHAR(50) NULL, TheKey INT NOT NULL PRIMARY KEY);\r\nDECLARE @LastName TABLE (LastName NVARCHAR(50) NOT NULL, TheKey INT NOT NULL PRIMARY KEY);\r\nDECLARE @Suffix TABLE (Suffix NVARCHAR(10) NULL, TheKey INT NOT NULL PRIMARY KEY);\r\nINSERT INTO @Title (title, TheKey)\r\n  SELECT Person.Title, Row_Number() OVER (ORDER BY NewId()) FROM Person.Person;\r\nINSERT INTO @FirstName (FirstName, TheKey)\r\n  SELECT Person.FirstName, Row_Number() OVER (ORDER BY NewId()) FROM Person.Person;\r\nINSERT INTO @MiddleName (MiddleName, TheKey)\r\n  SELECT Person.MiddleName, Row_Number() OVER (ORDER BY NewId()) FROM Person.Person;\r\nINSERT INTO @LastName (LastName, TheKey)\r\n  SELECT Person.LastName, Row_Number() OVER (ORDER BY NewId()) FROM Person.Person;\r\nINSERT INTO @Suffix (Suffix, TheKey)\r\n  SELECT Person.Suffix, Row_Number() OVER (ORDER BY NewId()) FROM Person.Person;\r\n--\r\nSELECT BusinessEntityID, PersonType,\r\n  RowsWithoutName.NameStyle, TheTitle.title, TheFirstName.FirstName,\r\n  TheMiddlename.MiddleName, TheLastName.LastName, TheSuffix.Suffix,\r\n  RowsWithoutName.EmailPromotion, RowsWithoutName.AdditionalContactInfo,\r\n  RowsWithoutName.Demographics, RowsWithoutName.rowguid,\r\n  RowsWithoutName.ModifiedDate\r\n  FROM\r\n    (\r\n    SELECT Person.BusinessEntityID, Person.PersonType, Person.NameStyle,\r\n      Person.EmailPromotion, Person.AdditionalContactInfo, Person.Demographics,\r\n      Person.rowguid, Person.ModifiedDate,\r\n      Row_Number() OVER (ORDER BY NewId()) AS TheKey\r\n      FROM Person.Person\r\n    ) AS RowsWithoutName\r\n    INNER JOIN @Title AS TheTitle\r\n      ON RowsWithoutName.TheKey = TheTitle.TheKey\r\n    INNER JOIN\r\n  --add the randomised firstname field\r\n  @FirstName AS TheFirstName\r\n      ON RowsWithoutName.TheKey = TheFirstName.TheKey\r\n    INNER JOIN\r\n  --add the randomised MiddleName field as well\r\n  @MiddleName AS TheMiddlename\r\n      ON RowsWithoutName.TheKey = TheMiddlename.TheKey\r\n    INNER JOIN\r\n  --and the randomised lastname field\r\n  @LastName AS TheLastName\r\n      ON RowsWithoutName.TheKey = TheLastName.TheKey\r\n    INNER JOIN\r\n  --and lastly the randomised suffix field as well\r\n  @Suffix AS TheSuffix\r\n      ON RowsWithoutName.TheKey = TheSuffix.TheKey\r\n  ORDER BY RowsWithoutName.BusinessEntityID;\r\n<\/pre>\n<p>Which gives a different result, but of the right number of rows, every time!<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"732\" height=\"399\" class=\"wp-image-80369\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/08\/word-image-48.png\" \/><\/p>\n<h1>Summary<\/h1>\n<p>Shuffling and zipping are techniques that have a great value in pseudonymization but they aren\u2019t sufficient to produce spoof data. Shuffling is fine until you come across uncommon values which could give a clue to someone attempting to reverse the process. Remember that someone only needs to unmask just a few names to hit the headlines, especially if <a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/cloud-data\/pseudonymization-inference-attack\/\">they include the medical records of the governor of Massachusetts<\/a>. However, zipping is a great way of inserting completely spoofed data into a table.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Having spent a lot of my working life trying to preserve the integrity of data, there was a certain intriguing novelty in the idea of pseudonymizing data. One of the standard techniques of pseudonymization is that of shuffling data columns as though you are shuffling cards. The original values are kept but placed in the&#8230;&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[4183,4252],"coauthors":[6813],"class_list":["post-80366","post","type-post","status-publish","format-standard","hentry","category-blogs","tag-t-sql","tag-t-sql-programming"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/80366","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=80366"}],"version-history":[{"count":2,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/80366\/revisions"}],"predecessor-version":[{"id":80371,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/80366\/revisions\/80371"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=80366"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=80366"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=80366"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=80366"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}