{"id":79810,"date":"2018-07-17T09:33:47","date_gmt":"2018-07-17T09:33:47","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=79810"},"modified":"2026-05-08T10:25:25","modified_gmt":"2026-05-08T10:25:25","slug":"spoofing-data-convincingly-altering-table-data","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/other\/spoofing-data-convincingly-altering-table-data\/","title":{"rendered":"SQL Server Data Masking Techniques: Spoofing Existing Table Data for Test and Development"},"content":{"rendered":"<p><b>Test and development environments often need data that looks like production &#8211; realistic distributions, edge cases, representative volumes &#8211; but without the PII and compliance exposure of actual production data. Data masking (sometimes called anonymisation, obfuscation, or spoofing) is the practice of transforming production data in place or in a copy so that the structure, distributions, and edge cases remain realistic while the specific values no longer identify real people or transactions. <\/b><\/p>\n<p><b>This article covers three core masking techniques in SQL Server: name substitution from lookup tables (replace each real name with a random name from a reference dataset, preserving gender and cultural distribution), locality-preserving address transformations (keep city, state, and country while randomising street-level detail), and plausible-distribution-preserving obfuscation of dates and numerics (add random offsets within believable ranges rather than replacing with entirely new values). <\/b><\/p>\n<p><b>The goal: test queries and applications with data that exercises the same patterns as production, without exposing anyone&#8217;s real information.<\/b><\/p>\n<p>When you are developing an existing database, or demonstrating it, you nowadays need <a href=\"https:\/\/www.red-gate.com\/simple-talk\/cloud\/cloud-data\/pseudonymization-inference-attack\/\">pseudonymised<\/a> data, or even better, completely anonymized data. This data has to look right at first glance, and it needs to have the same distribution as the real data.<\/p>\n<p>Although we are yet to tackle continuous variables with complicated distributions such as sales figures or dates, we can now generate words. This article is about using a finite-state Markov chain with stationary transition possibilities. You can use the technique to generate words, sentences, <a href=\"https:\/\/www.red-gate.com\/simple-talk\/opinion\/opinion-pieces\/relational-algebra-and-its-implications-for-nosql-databases\/\">articles<\/a> <a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/t-sql-programming\/the-parodist-a-sql-server-application\/\">or entire books<\/a>.<\/p>\n<p><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/spoofing-data-convincingly-for-database-development-an-introduction\/\">In our first attempt to do a markov-style obfuscation of words<\/a>, we used an algorithm that was very iterative and was more useful for generating file-based lists of words than for SQL based pseudonymization. File-based lists are excellent for use with data masking, or data-generation, tools. However, if we are using SQL we have the additional resource of the markov table that we can use whenever we want. The advantage of this is that we can change the algorithm to apply to the markov table in order to generate different degrees of data obfuscation.<\/p>\n<p>Before we go on to obfuscate other datatypes than words, we\u2019ll take a bit of time to show how one might use Markov tables in order to pseudonymise tables in-place.<\/p>\n<p>Let\u2019s take, for example, the AdventureWorks person.person table: We want to be able to obfuscate it easily by means of :<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true  \">UPDATE AdventureWorks2016.Person.Person \n\tSET title=dbo.FakedStringFrom(@TitleMarkov),\n\tfirstName=dbo.FakedStringFrom(@FirstnameMarkov),\n\tMiddleName=dbo.FakedStringFrom(@MiddlenameMarkov),\n\tLastName=dbo.FakedStringFrom(@LastnameMarkov)\n<\/pre>\n<p>(No, I copied the database first!)<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"547\" height=\"415\" class=\"wp-image-79811\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-92.png\" \/><\/p>\n<p>Before you get over-excited, it took twenty-eight seconds to do the 20,000 rows of the task. You\u2019d do much faster with prepared tables of obfuscated words, or word collections in files used by a third-party tool, but it is a start.<\/p>\n<p>Here was the same data before we pseudonymized it.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-79817\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/before.png\" alt=\"\" width=\"580\" height=\"432\" \/><\/p>\n<p>We check the distributions of the data before and after we make the changes, to make sure we haven\u2019t altered it. Let\u2019s first check on the distributions of customers\u2019 <strong>lastname<\/strong>s before and after running our process.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"404\" height=\"433\" class=\"wp-image-79812\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-93.png\" \/><\/p>\n<p>And we can do the same with <strong>firstname<\/strong>s just to make sure we are matching the existing distributions<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"422\" height=\"435\" class=\"wp-image-79813\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/word-image-94.png\" \/><\/p>\n<p>We are taking a slow-but-steady approach. We rewrite our code from the previous blog post that assembles the string; it now uses a view to get its random numbers, and we\u2019ll speed it up slightly by putting a bit more intelligence into the markov table. We then put it in a slow <a href=\"https:\/\/www.red-gate.com\/simple-talk\/sql\/t-sql-programming\/sql-server-user-defined-functions\/\">User-defined Scalar function<\/a>. We \u00a0want a scalar function that isn\u2019t schema verified and is not considered to be deterministic. The reason for this is that it has to be executed every row despite having the same parameter.<\/p>\n<p>There are many ways to store the information permanently in a Markov table but we\u2019ll be using Table-valued parameters for our function. I\u2019ll show how they are generated from the original information in AdventureWorks, but they could be so easily fetched from a table of markov entries with each markov set identified by a name. This could be delivered to you by the production DBA so that you wouldn\u2019t need any access to the production server.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">\/* drop anything that uses the two types we need to create *\/\nIF Object_Id('dbo.MarkovTableFrom', 'IF') IS NOT NULL DROP FUNCTION dbo.MarkovTableFrom;\nGO\nIF EXISTS(SELECT Object_Id('dbo.FakedStringFrom')) DROP FUNCTION dbo.FakedStringFrom\nGO\n\/* create the two types we will use as table-valued parameters *\/\n \nIF EXISTS (SELECT * FROM sys.types WHERE types.name LIKE 'StringValue')\n       DROP type StringValue \n\/* Create a table type to store strings in. *\/\n  CREATE TYPE StringValue AS TABLE (String NVARCHAR(50) NOT null);\nGO\nIF EXISTS (SELECT * FROM sys.types WHERE types.name LIKE 'Markov')\n       DROP type dbo.Markov\nGO\n\/* Create a table type to store Markov data in. *\/\n\/* Object:  UserDefinedTableType dbo.Markov *\/ \nCREATE TYPE dbo.Markov AS TABLE\n  (\n  trigraph NCHAR(3) NOT NULL,\n  iteration TINYINT NOT NULL,\n  frequency INT NOT NULL,\n  totalInSet INT NOT NULL,\n  runningTotal INT NOT NULL,\n  TheOrder INT NOT NULL,\n  Choices INT NOT NULL,\n  TheEND BIT NOT NULL,\n  PRIMARY KEY CLUSTERED (iteration ASC, trigraph ASC) WITH (IGNORE_DUP_KEY = OFF)\n  );\nGO\n\n\/* Create a view to sneak into a function. Random\nnumbers are detected and they are not allowed.\n*\/\n \nDROP VIEW [dbo].[SingleRandomNumber]\nGO\nCREATE VIEW [dbo].[SingleRandomNumber]\nAS\nSELECT  RAND() as RandomNumber\nGO\n \nIF Object_Id('MarkovTableFrom') IS NOT NULL\n   DROP function MarkovTableFrom\nGO\nCREATE FUNCTION MarkovTableFrom\n\/**\n  Summary: &gt;\n   in our table, we will take every three-character trigraph in the word sequence \n   and calculate the frequency with which it appears in the collection passed to\n   the routine. \n   This is simple to do with a GROUP BY clause. \n   The quickest way I've found of doing this is by a cross-join with a number \n   table (I've used a VALUE table here to reduce dependencies). The result is then\n   fed to a window expression that calculates the running total of the\n   frequencies. You'll need this to generate the right distribution of \n   three-character triads within the list of sample words\n   This version of the Markov table has extra fields, more than is strictly\n   necessary, in order to make it easier to do set-based operations using it. \n \n  Author: Phil Factor\n  Date: 10\/07\/2018\n  Database: PhilFactor\n  Examples:\n     - Select trigraph, iteration, frequency, runningTotal, \n\t       theOrder, choices, TheEnd \n\t\t from dbo.MarkovTableFrom(@MyStringValues)\n  Returns: &gt;\n   A Markov Table\n          **\/\n  (@SampleStrings StringValue READONLY)\n RETURNS TABLE\n --WITH ENCRYPTION|SCHEMABINDING, ..\nAS\nRETURN -- (trigraph, iteration, frequency ,runningTotal, theOrder, choices, TheEnd)\n  (\n  With MarkovBasics AS (\n   SELECT Frequencies.trigraph, Frequencies.Iteration, Frequencies.frequency,\n    Sum(Frequencies.frequency) OVER \n      (PARTITION BY Frequencies.Iteration,Left(trigraph,2) \n         ORDER BY Frequencies.frequency,trigraph)\n       AS RunningTotal,\n  \tROW_NUMBER() OVER(PARTITION BY Substring(trigraph,1,2),iteration \n\t  ORDER BY (SELECT 1)) AS TheOrder,\t0 AS choices,\n  \tCASE WHEN Left(trigraph,2) LIKE '%|%' THEN 1 ELSE 0 END AS TheEnd,\n\tSum(Frequencies.frequency) OVER \n      (PARTITION BY Frequencies.Iteration,Left(trigraph,2) ORDER BY (SELECT 1))\n       AS totalInSet\n    FROM\n      (\n      SELECT \n          Substring('  ' + Coalesce(String, '') + '|', f.iteration, 3) AS trigraph,\n          f.iteration, Count(*) AS frequency\n        FROM @SampleStrings AS p\n          CROSS JOIN\n  \t\t(VALUES(1),(2),(3),(4),(5),(6),(7),(8),(9),(10),\n  \t\t(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),\n  \t\t(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),\n  \t\t(31),(32),(33),(34),(35),(36),(37),(38),(39),(40)\n  \t\t) f(iteration)\n        WHERE Substring('  ' + Coalesce(String, '') + '|', f.iteration, 3) &lt;&gt; ''\n        GROUP BY Substring('  ' + Coalesce(String, '') + '|', f.iteration, 3),\n        f.iteration\n      ) AS Frequencies(trigraph, Iteration, frequency)\n    )\n  SELECT \n    trigraph, f.iteration , frequency , runningTotal, \n    totalInSet, theOrder,f.choices,TheEnd\n  FROM MarkovBasics\n  INNER JOIN \n     ( --we add in the number of choices at any point and give each choice an id 1..n\n     SELECT Count(*) AS choices, Substring(trigraph,1,2) AS leading ,iteration\n       FROM MarkovBasics \n       GROUP BY Substring(trigraph,1,2),iteration\n     )f(choices, leading,iteration)\n  on f.iteration=MarkovBasics.iteration\n  AND f.leading=Substring(MarkovBasics.trigraph,1,2) \n  )\ngo\n \nIF EXISTS(SELECT Object_Id('dbo.FakedStringFrom')) \n    DROP FUNCTION dbo.FakedStringFrom\nGO\nCREATE FUNCTION dbo.FakedStringFrom\n\/**\n  Summary: &gt;\n  This function builds a string by random numbers\n  from a Markov table passed to it via a TVP. Each\n  time it is called, it gives a different string, and\n  in this algorithm, the first three characters of each\n  string are provided will fit the distribution of the \n  words used to create the markov table. The following\n  characters are chosen with equal probability just to \n  increase the 'noise' of the word (you can tweak this) \n \n  Author: Phil Factor\n  Date: 10\/07\/2018\n  Database: PhilFactor\n  Examples:\n     -   SELECT dbo.FakedStringFrom(@CountryMarkov)\n\t     FROM AdventureWorks2016.Person.Person AS P;\n  Returns: &gt;\n   A string\n          **\/\n    (@MarkovTable AS Markov readonly)\nRETURNS nvarchar(50)\n--WITH ENCRYPTION|SCHEMABINDING, ...\nAS\nBEGIN\n    DECLARE @RowCount INT, @ii INT, @NewWord VARCHAR(50) = '  ', @done INT=1\n\tDECLARE @EntireRange INT --for matching the sample distribution \n\t\t= (SELECT Sum(frequency) FROM @MarkovTable WHERE iteration = 3),\n\t\t@Random FLOAT = (SELECT TOP 1 randomnumber FROM SingleRandomNumber);\n\tDECLARE @MaxLength INT = 50;\n\t--now we can use markov chains to assemble the word\n\tSELECT @ii = 1, @RowCount = 1,@done=0;\n\tWHILE ((@ii &lt; @MaxLength) AND (@RowCount &gt; 0))\n\t\tBEGIN --we can make our choice based on its distribution but \n\t\t--it tends to reproduce more of the real data and less made up\n\t\tSELECT @NewWord = @NewWord + Right(M.trigraph, 1)\n\t\tFROM @MarkovTable M\n\t\tWHERE iteration = @ii\n\t\tAND M.trigraph LIKE Right(@NewWord, 2) + '_'\n\t\tAND (@Random * M.totalInSet) \n\t\tBETWEEN (runningTotal- frequency)\n\t\tAND runningTotal;\n\t\tSELECT @random= randomnumber FROM SingleRandomNumber;\n\t\tSELECT @RowCount = @@RowCount, @ii = @ii + 1;\n\t\tEND;\n\treturn Replace(LTrim(@NewWord), '|', '');\n END\nGO<\/pre>\n<p>Now we can try it out, using AdventureWorks.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true \">\/* Now, to test things out lets get some markov tables as TVPs.\nFirstly, lets get the firstnames from Adventureworks Person.Person.FirstName *\/\nDECLARE @firstnameStrings StringValue,@FirstnameMarkov Markov;\nINSERT INTO @firstnameStrings (String)\n  SELECT P.FirstName FROM AdventureWorks2016.Person.Person AS P;\nINSERT INTO @FirstnameMarkov\n    (trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,\n    TheEND)\n  SELECT trigraph, iteration, frequency, totalInSet, runningTotal,\n       theOrder, choices,TheEnd\n    FROM dbo.MarkovTableFrom(@firstnameStrings);\n\/* Now, lets get the Last Names from Adventureworks Person.Person.LastName *\/\nDECLARE @LastNameStrings StringValue,@LastNameMarkov Markov;\nINSERT INTO @LastNameStrings(String) \n  SELECT P.LastName\n    FROM  AdventureWorks2016.Person.Person AS P\nINSERT INTO @LastNameMarkov     \n    (trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,\n    TheEND)\n  SELECT trigraph, iteration, frequency, totalInSet, runningTotal,\n       theOrder, choices,TheEnd\n    FROM MarkovTableFrom(@LastNameStrings);\n\/* OK, lets get the Middle Names from Adventureworks Person.Person.MiddleName *\/\nDECLARE @MiddleNameStrings StringValue,@MiddleNameMarkov Markov;\nINSERT INTO @MiddleNameStrings(String) SELECT Coalesce(P.MiddleName,'')\n  FROM  AdventureWorks2016.Person.Person AS P\nINSERT INTO @MiddleNameMarkov \n    (trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,\n    TheEND)\n  SELECT trigraph, iteration, frequency, totalInSet, runningTotal,\n       theOrder, choices,TheEnd\n    FROM MarkovTableFrom(@MiddleNameStrings);\n\/* Lastly, we'll get the Titles from Adventureworks Person.Person.Title *\/\nDECLARE @TitleStrings StringValue,@TitleMarkov Markov;\nINSERT INTO @TitleStrings(String) SELECT Coalesce(P.Title,'')\n   FROM  AdventureWorks2016.Person.Person AS P\nINSERT INTO @TitleMarkov \n    (trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,\n    TheEND)\n  SELECT trigraph, iteration, frequency, totalInSet, runningTotal,\n       theOrder, choices,TheEnd\n    FROM MarkovTableFrom(@TitleStrings);\nSELECT \n    dbo.FakedStringFrom(@FirstnameMarkov) AS Firstname, \n    dbo.FakedStringFrom(@MiddlenameMarkov) AS Middlename,\n\tdbo.FakedStringFrom(@LastnameMarkov) AS lastname\n  FROM  AdventureWorks2016.Person.Person AS P<\/pre>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-80432\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2018\/07\/markov.png\" alt=\"\" width=\"222\" height=\"671\" \/><\/p>\n<p>Although this is fun, we need to move on to do more. You will want to try out what we\u2019ve done on that modification date in AdventureWorks. If you do, you will see that it will produce dates that are legal, but our algorithm won\u2019t produce the correct distribution. Even if we change the <strong>dbo.FakedStringFrom <\/strong>function to choose all the characters according to their statistical distribution it will fail to work convincingly with small samples. It needs a rather different technique. We\u2019ll talk about that in the next blog in this series.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Mask SQL Server production data for test and development environments &#8211; name substitution from lookup tables, locality-preserving address transformations, and plausible-distribution-preserving obfuscation of dates, numbers, and identifiers. Keeps data realistic enough for meaningful testing while removing PII. Complements Redgate Data Masker and similar tooling.&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[68855],"coauthors":[6813],"class_list":["post-79810","post","type-post","status-publish","format-standard","hentry","category-other","tag-sql-provision"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79810","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=79810"}],"version-history":[{"count":5,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79810\/revisions"}],"predecessor-version":[{"id":110413,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/79810\/revisions\/110413"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=79810"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=79810"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=79810"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=79810"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}