{"id":79898,"date":"2018-07-18T13:11:13","date_gmt":"2018-07-18T13:11:13","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=79898"},"modified":"2019-01-22T17:07:51","modified_gmt":"2019-01-22T17:07:51","slug":"spoofing-data-convincingly-masking-continuous-variables","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-masking-continuous-variables\/","title":{"rendered":"Spoofing Data Convincingly: Masking Continuous Variables."},"content":{"rendered":"<p>In any real numeric data from a database , you are only rarely going to see any sort of normal distribution of the values. Sales data will rise and fall according to the time of year and the economic cycle. The date of input of a record will vary with the workload. If you plot any real data, you are unlikely to see that bell-curve or straight line. It will look more like the welsh hills. (or, technically \u2018multimodal\u2019). Truly independent data will tend to conform with the bell-curve whereas dependent data tends to be multimodal.<\/p>\n<p>Imagine that you have a table giving invoice values. You will want your spoof data to conform with the same ups and downs of the real data over time. You may be able to get the overall distribution the same as the real data, but the resulting data would be useless for seeing the effect of last years sales promotion. The invoice values will depend on your sales promotions if your marketing people have done their job properly.<\/p>\n<p>By making your data the same distribution as your production data, you don\u2019t necessarily get the same strategy chosen by the query analyser, but you dramatically increase the chances of getting it. SQL Server uses a complex paradigm to select amongst its alternative plans for a query. It maintains distribution statistics for every column and index that is used for selecting rows. These aren\u2019t actually histograms in the classic sense, but they perform a similar function and are used by the SQL Server engine to predict the number of rows that will be returned.<\/p>\n<p>The easiest numeric data to spoof is an independent variable, so we can start with that.<\/p>\n<p>We will collect the current values and use them to create a<a href=\"https:\/\/en.wikipedia.org\/wiki\/Histogram\"> histogram<\/a>. We then use this to select the bin (I prefer the alternative term \u2018bucket\u2019 as the word \u2018bin\u2019 is too overloaded!) from which we then generate the random number. This will result in a distribution that is a bit more jagged than it should be, but any sort of interpolation is a bit of a distraction at this stage. It is good enough, because pseudonymization can lack finesse, yet be fit for the intended purpose.<\/p>\n<p>We aim to produce a simple function to produce a random float that can be used to stock the <strong>ModifiedDate<\/strong> column of <strong>AdventureWorks2016.Person.Person<\/strong> (which we have, of course, copied for the purpose.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">UPDATE AdventureWorksCopy.Person.Person \r\n\tSET modifiedDate=Convert(DATETIME,dbo.RandomFloat(@dh))\r\n<\/pre>\n<p><strong>@dh<\/strong> is our distribution Histogram. The rest is mere detail!<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">IF Object_Id('dbo.RandomFloatTable') IS NOT NULL\r\n   DROP function dbo.RandomFloatTable\r\nIF Object_Id('dbo.RandomFloat') IS NOT NULL\r\n   DROP function dbo.RandomFloat   \r\nIF Object_Id('dbo.CreateDistributionHistogram') IS NOT NULL \r\n   DROP function dbo.CreateDistributionHistogram\r\nGO\r\n\r\nIF EXISTS (SELECT * FROM sys.types WHERE types.name LIKE 'DistributionHistogram')\r\nDROP TYPE DistributionHistogram\r\nIF EXISTS(SELECT * FROM sys.types WHERE name LIKE 'ContinuousVariable')\r\nDROP TYPE ContinuousVariable\r\n\r\nCREATE TYPE ContinuousVariable AS TABLE   \r\n(number FLOAT NOT null) \r\nGO  \r\n\r\nCREATE TYPE [dbo].[DistributionHistogram] AS TABLE(\r\n  Bucket INT NOT NULL, --the number in the sequence 1..40\r\n  StartValue FLOAT NOT NULL,--the starting value for membership of the bucket\r\n  EndValue FLOAT NOT NULL,--the end value for membership of the bucket\r\n  \"distribution\" INT NOT NULL, --the number of values in this bucket\r\n  RunningTotal INT NOT NULL, --the number of values in or lower than this point\r\n\tPRIMARY KEY CLUSTERED \r\n(\r\n\t[runningtotal] ASC,\r\n\t[distribution] ASC\r\n)WITH (IGNORE_DUP_KEY = OFF)\r\n)\r\nGO\r\n\r\nIF Object_Id('dbo.RandomFloatTable') IS NOT NULL\r\n   DROP function dbo.RandomFloatTable\r\nGO\r\nCREATE FUNCTION dbo.RandomFloatTable\r\n\/**\r\nSummary: &gt;\r\n  This returns a random float from\r\n  a population described by the distibution histogram\r\nAuthor: sa\r\nDate: 18\/07\/2018\r\nDatabase: PhilFactor\r\nExamples:\r\n   - Select * from dbo.RandomFloat(@dh)\r\n   - Select * from MyTable cross apply dbo.RandomFloat(MyColumn)\r\nReturns: &gt;\r\n  A table with a single result\r\n        **\/\r\n  (\r\n  @dh DistributionHistogram readonly\r\n  )\r\nRETURNS TABLE\r\n --WITH ENCRYPTION|SCHEMABINDING, ..\r\nAS\r\nRETURN\r\n  (\r\n  SELECT [@dh].StartValue\r\n         + ((SELECT randomnumber FROM SingleRandomNumber)\r\n            * ([@dh].EndValue - [@dh].StartValue)) AS RandomNumber\r\n    FROM @dh\r\n      CROSS JOIN\r\n        (\r\n        SELECT randomnumber * (SELECT Max(RunningTotal) FROM @dh)\r\n          FROM SingleRandomNumber\r\n        ) AS f(choice)\r\n    WHERE f.choice \r\n      BETWEEN [@dh].RunningTotal - [@dh].distribution AND [@dh].RunningTotal \r\n  );\r\n\r\ngo\r\nIF Object_Id('dbo.RandomFloat') IS NOT NULL\r\n   DROP function dbo.RandomFloat\r\nGO\r\nCREATE FUNCTION dbo.RandomFloat (@dh DistributionHistogram READONLY --our distribution histogram\r\n)\r\nRETURNS FLOAT\r\nAS\r\n  BEGIN\r\n    DECLARE @where INT, @randomnumber FLOAT; \/*a bucket on our distribution histogram\r\n\t where we create our number *\/\r\n    SELECT @where = randomnumber * (SELECT Max([@dh].RunningTotal) FROM @dh)\r\n      FROM SingleRandomNumber;\r\n    SELECT @randomnumber = --we create a random number within the range of the bucket\r\n      StartValue\r\n      + ((SELECT randomnumber FROM SingleRandomNumber)\r\n         * (EndValue - StartValue)\r\n        )\r\n      FROM @dh --the distribution histogram\r\n      WHERE @where BETWEEN RunningTotal - distribution AND RunningTotal;\r\n    RETURN @randomnumber;\r\n  END;\r\nGO\r\n\r\nIF Object_Id('dbo.CreateDistributionHistogram') IS NOT NULL \r\n   DROP function dbo.CreateDistributionHistogram\r\nGO\r\nCREATE FUNCTION dbo.CreateDistributionHistogram\r\n\/**\r\nSummary: &gt;\r\n  This creates a Distribution histogram from\r\n  a table that is just a list of FLOATs\r\nAuthor: Phil Factor\r\nDate: 18\/07\/2018\r\nDatabase: PhilFactor\r\nExamples:\r\n   - Select * from dbo.CreateDistributionHistogram(@InVar)\r\nReturns: &gt;\r\n  A table \r\n  Bucket, StartValue, EndValue,  \"distribution\",  RunningTotal\r\n        **\/\r\n (\r\n \t@InVar ContinuousVariable readonly\r\n )       \r\nRETURNS @histogram  TABLE  -- create our distribution map\r\n  (--Bucket, StartValue, EndValue,  \"distribution\",  RunningTotal\r\n  Bucket INT NOT NULL,\r\n  StartValue FLOAT NOT NULL,\r\n  EndValue FLOAT NOT NULL,\r\n  \"distribution\" INT NOT NULL,\r\n  RunningTotal INT NOT NULL\r\n  )\r\n\r\nAS\r\n-- body of the function\r\nBEGIN\r\n\t\tDECLARE @Maximum float,--the maximum value in the set\r\n\t        @Minimum float ,--the minimum value in the set\r\n\t        @increment FLOAT; --the span of values in each grouping\r\n\t--find the maximum and min imum values in the set\r\n\tSELECT @Maximum = MAX(number)+StDev(number), \r\n\t       @minimum = MIN(number)-StDev(number) \r\n\t  FROM @InVar;\r\n\t--find the width of each grouping, allowing for random outliers\r\n\tSELECT @increment= (@Maximum-@Minimum)\/40.00;\r\n\t  \r\n\t  WITH buckets\r\n\t  AS (SELECT OneToForty.value AS Bucket,\r\n\t        @Minimum + (@increment * OneToForty.value) AS StartValue,\r\n\t        @Minimum + (@increment * (OneToForty.value + 1)) AS EndValue\r\n\t        FROM\r\n\t          (\r\n\t          VALUES --the range of each 'bucket'\r\n\t            (1), (2), (3), (4), (5), (6), (7), (8), (9), (10), (11), (12), (13),\r\n\t            (14), (15), (16), (17), (18), (19), (20), (21), (22), (23), (24),\r\n\t            (25), (26), (27), (28), (29), (30), (31), (32), (33), (34), (35),\r\n\t            (36), (37), (38), (39), (40)\r\n\t          ) AS OneToForty (value) )\r\n\t    INSERT INTO @histogram (Bucket, StartValue, EndValue,  \"distribution\",  RunningTotal)\r\n\t\t SELECT allbuckets.Bucket, allbuckets.StartValue, allbuckets.EndValue,\r\n\t      Coalesce(Goodbuckets.coun, 1) AS \"distribution\",\r\n\t      Sum(Coalesce(Goodbuckets.coun, 1)) OVER (ORDER BY allbuckets.Bucket) AS RunningTotal\r\n\t      FROM buckets AS allbuckets\r\n\t        LEFT OUTER JOIN\r\n\t          (\r\n\t          SELECT buckets.Bucket, Count(*) AS coun\r\n\t            FROM @InVar iv\r\n\t              INNER JOIN buckets\r\n\t                ON iv.number BETWEEN buckets.StartValue AND buckets.EndValue\r\n\t            GROUP BY buckets.Bucket\r\n\t          ) AS Goodbuckets\r\n\t          ON Goodbuckets.Bucket = allbuckets.Bucket\r\n\t      ORDER BY allbuckets.Bucket \r\n   RETURN\r\nEND\r\nGO\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>Here are the first few rows with selected columns from the original table<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"499\" height=\"403\" class=\"wp-image-79899\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-149.png\" \/><\/p>\n<p>\u2026and the same rows, now pseudonymized<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"552\" height=\"401\" class=\"wp-image-79900\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-150.png\" \/><\/p>\n<p>We can now compare the distribution of the old and new data, using the routine I published here in \u2018<a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/t-sql-programming\/visual-checks-data-distributed-sql-server\/\">Visual Checks on How Data is Distributed in SQL Server\u2019<\/a><\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true \">DECLARE @FirstVariable IndependentVariable; --a simple table of numbers\r\nINSERT INTO @FirstVariable\r\n  (number)\r\n  SELECT Convert(FLOAT,(Person.ModifiedDate))\r\n  FROM Adventureworks2016.Person.Person;\r\nSELECT ColumnHistogram.line AS 'Original Modified date'\r\n  FROM dbo.ColumnHistogram(@FirstVariable)\r\n  ORDER BY ColumnHistogram.y DESC;\r\nGO\r\nDECLARE @SecondVariable IndependentVariable; --a simple table of numbers\r\nINSERT INTO @SecondVariable\r\n  (number)\r\n  SELECT Convert(FLOAT,(Person.ModifiedDate))\r\n  FROM AdventureworksCopy.Person.Person;\r\nSELECT ColumnHistogram.line AS 'Spoofed Modified date'\r\n  FROM dbo.ColumnHistogram(@SecondVariable)\r\n  ORDER BY ColumnHistogram.y DESC;\r\nGO\r\n\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"328\" height=\"362\" class=\"wp-image-79901\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-151.png\" \/><\/p>\n<p>The difference is due to the changed size and ranges of the buckets. It is good enough<\/p>\n<p>We now have a way of randomising independent data, but what if we need to retain the relationship between, for example, date of invoice and invoice amount, We\u2019d need this if we were needing to use data for practice and training in BI, for example. You\u2019ll see that there was a relationship between the modified date and the BusinessEntityID. The easiest way of doing this is to use the original value as a starting point.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true \">IF Object_Id('dbo.BlurredFloat') IS NOT NULL\r\n   DROP function dbo.BlurredFloat\r\nGO\r\nCREATE FUNCTION dbo.BlurredFloat\r\n    (@dh DistributionHistogram READONLY,@OriginalNumber float) --our distribution histogram\r\n\r\nRETURNS FLOAT\r\nAS\r\n  BEGIN\r\n    DECLARE @where INT, @randomnumber FLOAT; \/*a bucket on our distribution histogram *\/\r\n   SELECT @randomnumber = --we create a random number within the range of the bucket\r\n      StartValue\r\n      + ((SELECT randomnumber FROM SingleRandomNumber)\r\n         * (EndValue - StartValue)\r\n        )\r\n      FROM @dh --the distribution histogram\r\n      WHERE @OriginalNumber BETWEEN StartValue AND EndValue;\r\n    RETURN @randomnumber;\r\n  END;\r\nGO\r\n<\/pre>\n<p>&nbsp;<\/p>\n<p>Which gives a Modification date closer to the original, possibly at the cost of some loss in the degree of obfuscation.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"529\" height=\"403\" class=\"wp-image-79902\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-152.png\" \/><\/p>\n<p>So we are now making progress. A way of generating continuous variables as floats either entirely randomly or just blurred is going to be very useful because we can coerce a lot of datatypes into floats and back again. What about strings made up of several words? Perhaps it is time to show a way of doing this for a database.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>In any real numeric data from a database , you are only rarely going to see any sort of normal distribution of the values. Sales data will rise and fall according to the time of year and the economic cycle. The date of input of a record will vary with the workload. If you plot&#8230;&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[68855],"coauthors":[6813],"class_list":["post-79898","post","type-post","status-publish","format-standard","hentry","category-blogs","tag-sql-provision"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79898","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=79898"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79898\/revisions"}],"predecessor-version":[{"id":79906,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79898\/revisions\/79906"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=79898"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=79898"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=79898"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=79898"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}