31 August 2018

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Phil Factor

31 August 2018

Pseudonymizing a Database with Realistic Data for Development Work

How to use SQL Data Generator, and PowerShell to obfuscate personal data (names), while retaining the same distribution of data, so that the test database behaves like the original.

Guest post

For this demonstration, we will take AdventureWorks and produce a pseudonymized copy for development work. The aim is to surgically alter just the data that can identify individuals but leave everything else intact. I’ve shown how to do something similar before, for a subset of the data, in Pseudonymizing your data with SQL Data Generator. However, this time we want to work with the whole database, and we also need to ensure that the new data has the same overall data distribution as the old data. We want the data to look uncannily like the real thing.

For the demonstration, we will restrict ourselves to removing just the names in a single table, and replacing them with randomised names, generated only from the characteristics of the real names, to get the same distribution as the original. Hopefully, the result is suitable for development work as well as testing.

Although we will use SQL Data Generator to do this, the principle also works with Data Masker for SQL Server, and you will want to use that tool if dealing with any relationships. The spoof data is generated according to a Markov distribution, in order to give it extra verisimilitude, and my Spoofing Data… series on Simple-Talk provides further details on using Markov chains in T-SQL, so I won’t repeat much of that detail here.

The Process

This process, once set up, is infinitely repeatable via PowerShell automation so that you can easily make it part of your daily build. It is designed to get over the problem that entirely generated data doesn’t ever conform to the same distribution of data as the original, or even to look much like the original data. Remember that were this real data, you would still need to keep the resulting data under the same security regime as the original, because it is pseudonymized rather than completely anonymized. Most of the data is left intact, so a determined hacker could identify a few individuals in pseudonymized data

Preparing a copy of the database with its data

In this demonstration, I’m using a database called Phasael, which is just copy of AdventureWorks2016, including the data. You can restore from a backup, attach a copy of the mdf file, or use the technique I show here, Deploying Multiple Databases from Source Control using SQL Change Automation, using SCA and BCP.

For the sake of this demonstration, we are only going to alter just one table, Person.Person. In reality, we’d need to find, and replace, all the personally identifiable information, such as the address information, and so on. We aren’t going to mess with any relationships, as defined by the FOREIGN KEY constraints. For that, you’d need Data Masker for SQL Server.

In SSMS, run Listing 1 on Phasael, to disable constraints. Don’t worry, this is a temporary thing you do when changing data and it is reversible, as I’ll eventually show.

1	EXEC sp_msforeachtable "ALTER TABLE ? NOCHECK CONSTRAINT all"

Listing 1

Preparing to make lists of data

Now we create the new data in SQL, using the real data as inspiration. This is done only once when starting this process, and thereafter only occasionally, if the characteristics of the original data change significantly.

We create a new utility database or use an existing one on the same server as the database we are obfuscating. To start, we create some utility procedures that we’ll use to generate lists of column values that contain spoof values that conform as closely as possible to real values.

Surnames, for example, might look like this: Serez, Wara, Garson, Subran, Ree, Perris, Gomirez, Zhans, Lopez, Riversen, Murkera, Herry, Munoz, Natterrez, Shan, Raje, Zhe, Kumarro, Simmon, Johns, Lopez, Liu, Kin or Gomez.

A Markov table merely records the frequency of every set of adjacent three-letter combinations (trigraphs). Factor, for example will be composed of ‘Fac’ (position 1), ‘act’ (position 2), ‘cto’ (position 3), and ‘tor’ (position 4). Add a few thousand last names into the mix, and there will be several trigraphs that can follow ‘Fac’. If we then select ‘Fac’ by random numbers, according to its frequency in the distribution we would choose one of them to get the next letter. Obviously, they would all start with ‘ac’. You can do it randomly, or by frequency in the population. We would choose the latter, though the former provides funnier results, and more akin to ‘smudging’. You then just continue finding each trigraph, each of which contributes a letter to the final string. When you reach the point where a trigraph is a string-terminator, then you’ve got your value.

100

101

102

103

104

105

106

107

108

109

110

111

112

113

114

115

116

117

118

119

120

121

122

123

124

125

126

127

128

129

130

131

132

133

134

135

136

137

138

139

140

141

142

143

144

145

146

147

148

149

150

151

152

153

154

155

156

157

158

159

160

161

162

163

164

165

166

167

168

169

170

171

172

173

174

175

176

177

178

179

180

181

182

/* drop anything that uses the two types we need to create */

IF Object_Id('dbo.MarkovTableFrom', 'IF') IS NOT NULL DROP FUNCTION dbo.MarkovTableFrom;

IF Object_Id('dbo.FakedStringFrom') IS NOT NULL DROP FUNCTION dbo.FakedStringFrom;

/* create the two types we will use as table-valued parameters */

IF EXISTS (SELECT * FROM sys.types WHERE types.name LIKE 'StringValue') DROP TYPE StringValue;

/* Create a table type to store strings in. */

CREATE TYPE StringValue AS TABLE

(String NVARCHAR(50) NOT NULL,

Category INT NOT NULL DEFAULT 1);

IF EXISTS (SELECT * FROM sys.types WHERE types.name LIKE 'Markov') DROP TYPE dbo.Markov;

/* Create a table type to store Markov data in. */

/* Object: UserDefinedTableType dbo.Markov */

CREATE TYPE dbo.Markov AS TABLE

(trigraph NCHAR(3) NOT NULL,

iteration TINYINT NOT NULL,

frequency INT NOT NULL,

totalInSet INT NOT NULL,

runningTotal INT NOT NULL,

TheOrder INT NOT NULL,

Choices INT NOT NULL,

TheEND BIT NOT NULL,

Category INT NOT NULL PRIMARY KEY CLUSTERED

(iteration ASC, trigraph ASC, category) WITH(IGNORE_DUP_KEY = OFF));

/* Create a view to sneak into a function. Random

numbers are detected and they are not allowed.

IF Object_Id('dbo.SingleRandomNumber') IS NOT NULL

DROP VIEW dbo.SingleRandomNumber

CREATE VIEW [dbo].[SingleRandomNumber]

SELECT Rand() AS RandomNumber;

IF Object_Id('MarkovTableFrom') IS NOT NULL DROP FUNCTION MarkovTableFrom;

CREATE FUNCTION MarkovTableFrom

/**

Summary: >

in our table, we will take every three-character trigraph in the word sequence

and calculate the frequency with which it appears in the collection passed to

the routine.

This is simple to do with a GROUP BY clause.

The quickest way I've found of doing this is by a cross-join with a number

table (I've used a VALUE table here to reduce dependencies). The result is then

fed to a window expression that calculates the running total of the

frequencies. You'll need this to generate the right distribution of

three-character triads within the list of sample words

This version of the Markov table has extra fields, more than is strictly

necessary, in order to make it easier to do set-based operations using it.

Author: Phil Factor

Date: 10/07/2018

Database: Utility

Examples:

- Select trigraph, iteration, frequency, runningTotal,

theOrder, choices, TheEnd

from dbo.MarkovTableFrom(@MyStringValues)

Returns: >

A Markov Table

**/

(@SampleStrings StringValue READONLY)

RETURNS TABLE

--WITH ENCRYPTION|SCHEMABINDING, ..

RETURN -- (trigraph, iteration, frequency ,runningTotal, theOrder, choices, TheEnd)

(WITH MarkovBasics

AS (SELECT Frequencies.trigraph, Frequencies.Iteration,

Frequencies.frequency,

Sum(Frequencies.frequency) OVER (PARTITION BY Frequencies.Iteration,

Left(Frequencies.trigraph, 2)

ORDER BY Frequencies.frequency,

Frequencies.trigraph) AS RunningTotal,

Row_Number() OVER (PARTITION BY Substring(

Frequencies.trigraph, 1, 2),

Frequencies.Iteration

ORDER BY

(SELECT 1)) AS TheOrder, 0 AS choices,

CASE WHEN Left(Frequencies.trigraph, 2) LIKE '%|%' THEN 1 ELSE

0 END AS TheEnd,

Sum(Frequencies.frequency) OVER (PARTITION BY Frequencies.Iteration,

Left(Frequencies.trigraph, 2)

ORDER BY

(SELECT 1)) AS totalInSet,

Frequencies.category

FROM

(SELECT Substring(

' ' + Coalesce(p.String, '') + '|',

f.iteration,

3) AS trigraph, f.iteration,

Count(*) AS frequency, p.Category

FROM @SampleStrings AS p

CROSS JOIN

(VALUES (1),(2),(3),(4),(5),(6),(7),(8),(9),(10),

(11),(12),(13),(14),(15),(16),(17),(18),(19),(20),

(21),(22),(23),(24),(25),(26),(27),(28),(29),(30),

(31),(32),(33),(34),(35),(36),(37),(38),(39),(40)

) f(iteration)

WHERE Substring(

' ' + Coalesce(p.String, '') + '|',

f.iteration,

3) <> ''

GROUP BY Substring(

' ' + Coalesce(p.String, '') + '|',

f.iteration,

3), f.iteration, p.Category) AS Frequencies(trigraph, Iteration, frequency, category) )

SELECT MarkovBasics.trigraph, f.iteration, MarkovBasics.frequency,

MarkovBasics.RunningTotal, MarkovBasics.totalInSet,

MarkovBasics.TheOrder, f.choices, MarkovBasics.TheEnd,

MarkovBasics.category

FROM MarkovBasics

INNER JOIN

( --we add in the number of choices at any point and give each choice an id 1..n

SELECT Count(*) AS choices,

Substring(MarkovBasics.trigraph, 1, 2) AS leading,

MarkovBasics.Iteration

FROM MarkovBasics

GROUP BY Substring(MarkovBasics.trigraph, 1, 2),

MarkovBasics.Iteration) f(choices, leading, iteration)

ON f.iteration = MarkovBasics.Iteration

AND f.leading = Substring(MarkovBasics.trigraph, 1, 2));

IF Object_Id('dbo.FakedStringFrom') IS NOT NULL DROP FUNCTION dbo.FakedStringFrom;

CREATE FUNCTION dbo.FakedStringFrom

/**

Summary: >

This function builds a string by random numbers

from a Markov table passed to it via a TVP. Each

time it is called, it gives a different string, and

in this algorithm, the first three characters of each

string are provided will fit the distribution of the

words used to create the markov table. The following

characters are chosen with equal probability just to

increase the 'noise' of the word (you can tweak this)

Author: Phil Factor

Date: 10/07/2018

Database: Utility

Examples:

- SELECT dbo.FakedStringFrom(@CountryMarkov)

FROM AdventureWorks2016.Person.Person AS P;

Returns: >

A string

**/

(@MarkovTable AS Markov READONLY)

RETURNS NVARCHAR(50)

--WITH ENCRYPTION|SCHEMABINDING, ...

BEGIN

DECLARE @RowCount INT, @ii INT, @NewWord VARCHAR(50) =

' ', @done INT = 1;

DECLARE @EntireRange INT --for matching the sample distribution

(SELECT Sum([@MarkovTable].frequency) FROM @MarkovTable WHERE [@MarkovTable].iteration = 3),

@Random FLOAT =

(SELECT TOP 1 SingleRandomNumber.RandomNumber FROM SingleRandomNumber);

DECLARE @MaxLength INT = 50;

--now we can use markov chains to assemble the word

SELECT @ii = 1, @RowCount = 1, @done = 0;

WHILE((@ii < @MaxLength) AND (@RowCount > 0))

BEGIN --we can make our choice based on its distribution but

--it tends to reproduce more of the real data and less made up

SELECT @NewWord = @NewWord + Right(M.trigraph, 1)

FROM @MarkovTable M

WHERE M.iteration = @ii

AND M.trigraph LIKE Right(@NewWord, 2) + '_'

AND

--(choices=1

--OR

(@Random * M.totalInSet) BETWEEN (M.runningTotal

- M.frequency) AND M.runningTotal;

SELECT @Random = SingleRandomNumber.RandomNumber FROM SingleRandomNumber;

SELECT @RowCount = @@RowCount, @ii = @ii + 1;

END;

RETURN Replace(LTrim(@NewWord), '|', '');

END;

Listing 2

Making the lists of data as tables

Now we can generate lists of the randomised spoof data, from markov chains. There will be a table for each list, just to keep things simple. We will generate lists for the FirstName, MiddleName, LastName, Title and Suffix columns.

This will take some time but remember that you only need to do it once, until the nature of the original data changes significantly.

I’ve generated this script by cut ‘n paste from a single batch that did just one column, so this is less arduous than it looks. You can, of course, do this with a stored procedure, but even that isn’t entirely as elegant as you’d imagine.

Firstly, let's get the firstnames from Adventureworks Person.Person.FirstName */

DECLARE @firstnameStrings StringValue, @FirstnameMarkov Markov;

INSERT INTO @firstnameStrings (String)

SELECT P.FirstName FROM AdventureWorks2016.Person.Person AS P;

INSERT INTO @FirstnameMarkov

(trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,

TheEND, category)

SELECT trigraph, iteration,

frequency, totalInSet,

runningTotal, theOrder,

choices, TheEnd, category

FROM dbo.MarkovTableFrom(@firstnameStrings);

SELECT dbo.FakedStringFrom(@FirstnameMarkov) AS Firstname

INTO FirstName

FROM AdventureWorks2016.Person.Person AS P;

/* Now get the middlenames from Adventureworks Person.Person.middleName */

DECLARE @middlenameStrings StringValue, @middlenameMarkov Markov;

INSERT INTO @middlenameStrings (String)

SELECT Coalesce(P.MiddleName, '') FROM AdventureWorks2016.Person.Person AS P;

INSERT INTO @middlenameMarkov

(trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,

TheEND, category)

SELECT trigraph, iteration,

frequency, totalInSet,

runningTotal, theOrder,

choices, TheEnd, category

FROM dbo.MarkovTableFrom(@MiddleNameStrings);

SELECT dbo.FakedStringFrom(@middlenameMarkov) AS middlename

INTO middleName

FROM AdventureWorks2016.Person.Person AS P;

-- and the last names

DECLARE @LastnameStrings StringValue, @LastnameMarkov Markov;

INSERT INTO @LastnameStrings (String)

SELECT P.LastName FROM AdventureWorks2016.Person.Person AS P;

INSERT INTO @LastnameMarkov

(trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,

TheEND, category)

SELECT trigraph, iteration,

frequency, totalInSet,

runningTotal, theOrder,

choices, TheEnd, category

FROM dbo.MarkovTableFrom(@LastNameStrings);

SELECT dbo.FakedStringFrom(@LastnameMarkov) AS Lastname

INTO LastName

FROM AdventureWorks2016.Person.Person AS P;

---and the suffix

DECLARE @SuffixStrings StringValue, @SuffixMarkov Markov;

INSERT INTO @SuffixStrings (String)

SELECT Coalesce(P.Suffix, '') FROM AdventureWorks2016.Person.Person AS P;

INSERT INTO @SuffixMarkov

(trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,

TheEND, category)

SELECT trigraph, iteration,

frequency, totalInSet,

runningTotal, theOrder,

choices, TheEnd, category

FROM dbo.MarkovTableFrom(@SuffixStrings);

SELECT dbo.FakedStringFrom(@SuffixMarkov) AS Suffix

INTO Suffix

FROM AdventureWorks2016.Person.Person AS P;

-- finally the titles

DECLARE @TitleStrings StringValue, @TitleMarkov Markov;

INSERT INTO @TitleStrings (String)

SELECT Coalesce(P.Title, '') FROM AdventureWorks2016.Person.Person AS P;

INSERT INTO @TitleMarkov

(trigraph, iteration, frequency, totalInSet, runningTotal, TheOrder, Choices,

TheEND, category)

SELECT trigraph, iteration,

frequency, totalInSet,

runningTotal, theOrder,

choices, TheEnd, category

FROM dbo.MarkovTableFrom(@TitleStrings);

SELECT dbo.FakedStringFrom(@TitleMarkov) AS Title

INTO Title

FROM AdventureWorks2016.Person.Person AS P;

Listing 3

Let’s just test out what we’ve done. Whereas the original had 56 people whose first name started with ‘Aar‘, all of which were Aaron, the Markov chains have introduced a rich variety of 49 names beginning with ‘Aar‘, because the string ar_ (where ‘_‘ means any character!) occurs with many different third characters in the data at the second position.

1 2	SELECT * FROM dbo.FirstName WHERE FirstName.Firstname LIKE 'Aar%' SELECT FirstName FROM Adventureworks2016.person.person WHERE Firstname LIKE 'Aar%'

Listing 4

It yields the results in Figure 1.

Figure 1

We can see this variety by using this query:

SELECT Count(*) AS Frequency, Substring(Person.FirstName, 2, 3) AS Trigraph

FROM AdventureWorks2016.Person.Person

WHERE Person.FirstName LIKE '_ar%'

GROUP BY Substring(Person.FirstName, 2, 3)

ORDER BY Count(*) DESC;

Listing 5

Which gives:

Figure 2

Note that some data can’t be generated via Markov chains. Credit cards, for example, require a different technique, see Spoofing Data Convincingly: Credit Cards. Also, I show how to do dates in Spoofing Data Convincingly: Masking Continuous Variables.

The whole point is to make the spoofed data convincing enough to fool the casual glance, and pass obvious CHECK constraints, whilst keeping the distribution of the original.

Copying the tables to files

Now we have these lists as tables, it is the work of a moment to turn them into ASCII lists in your filesystem. We’ll use PowerShell for this. This script will do any lists of a column in a table in a database, as long as it is on the source server, defined by $SourceServerName.

$SourceServerName = 'MyServer' # the server which contains the tables we take the lists from

$SourceLogin = 'Mylogin' #Leave blank for Windows authentication

$SourceOfLists = @( #specify the column and the qulalified table name,

@{ ColumnName = 'Firstname'; TableName = 'MyDatabase.dbo.Firstname'; Filename = 'Firstname' },

@{ ColumnName = 'MiddleName'; TableName = 'MyDatabase.dbo.MiddleName'; Filename = 'MiddleName' },

@{ ColumnName = 'Lastname'; TableName = 'MyDatabase.dbo.Lastname'; Filename = 'Lastname' },

@{ ColumnName = 'Title'; TableName = 'MyDatabase.dbo.Title'; Filename = 'Title' },

@{ ColumnName = 'Suffix'; TableName = 'MyDatabase.dbo.Suffix'; Filename = 'Suffix' }

)

$DestionationListDirectory = 'S:\work\SQL\Lists\'#or wherever

#none of our errors are recoverable

$ErrorActionPreference = "Stop"

$MS = 'Microsoft.SQLServer'

if (-not (Test-Path -PathType Container $DestionationListDirectory))

{

# we create the script directory (normally you get static data from source control

$null = New-Item -ItemType Directory -Force -Path "$DestionationListDirectory";

} #now we can save each list

<a id="post-465983-_Hlk522725608"></a>#create a connection object

# is this Windows Authentication or UserID/Password credentials?

$connection = New-Object -TypeName System.Data.SqlClient.SqlConnection

$connectionString = "Server=$SourceServerName;Integrated Security=True;"

if ($SourceLogin -ne '') #if no login specified, then it was a windows login

{

$connectionString = "Server=$SourceServerName;Integrated Security=False;"

if (Test-Path -path "$env:USERPROFILE\$SourceLogin-$SourceServerName.txt" -PathType leaf)

{

#has already got this set for this login so fetch it

$encrypted = Get-Content "$env:USERPROFILE\$SourceLogin-$SourceServerName.txt" | ConvertTo-SecureString

$encrypted.MakeReadOnly()

$Credentials = New-Object System.Data.SqlClient.SqlCredential($SourceLogin, $encrypted)

}

else

{

#hasn't got this set for this login

$AutomationCredentials = get-credential -Credential $SourceLogin

$AutomationCredentials.Password | ConvertFrom-SecureString |

Set-Content "$env:USERPROFILE\$SourceLogin-$SourceServerName.txt"

($password = $AutomationCredentials.Password).MakeReadOnly()

$Credentials = New-Object System.Data.SqlClient.SqlCredential(

($AutomationCredentials.UserName -replace '^.*\\|@.*$'),

$password

)

}

$connection.Credential = $Credentials

$connection.ConnectionString = $connectionString

$connection.Open() | out-null

$handler = [System.Data.SqlClient.SqlInfoMessageEventHandler] {

param ($sender,

$event) $global:message = "$($message)`n $($event.Message)"

};

$connection.add_InfoMessage($handler);

$connection.FireInfoMessageEventOnUserErrors = $true

$SourceOfLists | foreach{

$list = $_;

$cmd = new-Object System.Data.SqlClient.SqlCommand(@"

Select $($list.ColumnName) from $($list.TableName)

"@, $connection)

$cmd.CommandTimeout = 10

$rdr = $cmd.ExecuteReader() #because we are getting a result back

$columnWeWant = $list.ColumnName

{

$datatable = new-object System.Data.DataTable

$datatable.Load($rdr)

$TheList=$datatable | foreach{ $_."$columnWeWant" }

[io.file]::WriteAlllines("$($DestionationListDirectory)$($list.Filename).txt",$theList)

}

while ($rdr.IsClosed -eq $false)

$rdr.Close()

}

Listing 6

If you run this, then you will find your lists written out as text files in your file system.

Figure 3

If you do a lot of database testing, you are likely to end up with many of these lists, and they are very handy for the purpose. It is possible to stitch these into tables using a technique I demonstrate in the blog ‘Spoofing Data Convincingly: Doing the Shuffle and Zip’.

Creating the SQL Data Generator project file

Now we are ready to create the SQL Data Generator project file (.sqlgen)that we can then use to automate this process.

Open SQL Data Generator (SDG), start a new project, and set it to the database we need to pseudonymize, Phasael. Then, exclude all the tables, as in the illustration, by clicking on Actions | Exclude All.

Figure 4

SDG now has nothing to do, so our next job is to select the Person.Person table:

Figure 5

You will see a warning to the effect the Person.Person is referenced by a FOREIGN KEY constraint, so SDG can’t delete the data in the table before filling it. However, it doesn’t know that we already disabled the CHECK constraints, so we can ignore these warnings.

Now we specify a generator for the entire table, which will be used for all the columns as appropriate. At the moment, SDG has only one table-level generator, but it is extremely powerful, because individual columns can be subsequently over-ridden.

Figure 6

Hit Browse… to select an existing data source and specify the server and database to use as the source of the table. By default, SDG will look for the source table in the Phasael database, but of course the source mustn’t be the same database and table as the one we are stocking with data, so specify the original copy of the database, AdventureWorks2016, instead.

Select the Person.Person table and hit Finish. We have now populated our Person.Person table in Phasael with the original data. SQL Data Generator is nervous at this point, because it hasn’t detected that we’ve disabled constraints. Fortunately, SQL Data Generator will always attempt an operation, despite its warnings and misgivings.

Figure 7

If we select any column in the preview pane, above, we can over-ride its generator. We want to therefore over-ride the generator for our individual name-related columns, while keeping the rest of the data intact.

Figure 8

We want the File List generator, so we click it and SDG will choose one of the pre-supplied file lists from its Config folder. Hit Browse, and point it instead to our FirstName.txt file, and lo…the spoof data from our SQL and PowerShell machinations is now in the preview pane.

Figure 9

Now we just repeat this process for the four other lists, and we get this:

Figure 10

So, now, the names bear no resemblance to the originals. The final step is to see if our generator will work despite SQL Data Generator’s nervous warnings, so hit Actions | Generate Data and SDG will present you with this action plan:

The data generation actions are listed in execution order.
Delete

Removing all the rows from table [Person],[Person] using 'DELETE FROM [Person], [Person]"

Generate Data

Generating data for table [Person],[Person] with data from [MyServer].[AdventureWorks2016].[Person].[Person],

The number of rows inserted will be the same as the number of rows in the source data

And when you hit the Generate Data button, it does it. Once the data generation process completes, you now have the Phasael database, with obfuscated name data.

Make sure that you save the .sqlgen project file, which is an XML file with all your settings. With this, you can repeat the process as many times you need. If you plan to run this pseudonymization process often, it is worth nailing down all your settings, such as the location and name of lists, and the servers you use for the process. These SQLGEN projects are easily modified, but they are laborious.

The last thing you need to do is pop back over to SSMS and re-enable constraints on the our partially-obfuscated Phasael database.

1	EXEC sp_msforeachtable "ALTER TABLE ? WITH CHECK CHECK CONSTRAINT all"

Listing 7

Tidying up

This is, of course, only a demonstration, but if you are doing this for real, you need to be careful not to leave artefacts. We deleted the old rows from the database, but you have to be sure that you haven’t left the old data in the transaction log, or the data pages, when doing this for real with sensitive data. The best way is to BCP all the data across, from the database to a freshly-built copy, but this is only practical with small databases.

Automating the process

I show how to automate a SQL Data generation in my article, Automatically filling your SQL Server test databases with data. To see how to do this for a whole range of development machines, see Automatically build-and-fill multiple development databases using PowerShell and SQL Data Generator.

Conclusions

The worst part of preparing development databases with data that has the characteristics of the real data, but without any real personal data, is the time it takes to do it. Plenty of solutions work with small amounts of data. With this solution, once the SQLGEN file is prepared, the only slow parts are the preparation of the spoof data and the actual deletion of the existing data.

The lists of spoofed data don’t have to be of the entire set of data, since SQL Data Generator will cheerfully repeat its contents randomly for as long as you wish, and you only create the lists once. The deletion process won’t worry you much if it is automated and can run overnight.

I’d like to emphasize that this process can be done without SQL Data Generator, but only with a lot of hard work, testing and maintenance. There is no excuse nowadays for developing databases with production data that contains sensitive or personal data.

Product articles SQL Data Generator Generating Realistic Test Data
Pseudonymizing a Database with…

Guest post

Pseudonymizing a Database with Realistic Data for Development Work

Guest post

The Process

Preparing a copy of the database with its data

Preparing to make lists of data

Making the lists of data as tables

Copying the tables to files

Creating the SQL Data Generator project file

Tidying up

Automating the process

Conclusions

Tools in this post

SQL Data Generator

SQL Toolbelt

SQL Data Generator

Product articles SQL Data Generator Generating Realistic Test Data Pseudonymizing a Database with…

Guest post

Pseudonymizing a Database with Realistic Data for Development Work

Guest post

The Process

Preparing a copy of the database with its data

Preparing to make lists of data

Making the lists of data as tables

Copying the tables to files

Creating the SQL Data Generator project file

Tidying up

Automating the process

Conclusions

Tools in this post

SQL Data Generator

SQL Toolbelt

SQL Data Generator

You may also like

Comparing SQL Server Databases using DacPacs and SQL Compare

How to reformat a database in one operation

Generating realistic dates using SQL Data Generator and Python

Exploring New SQL Prompt Features in SQL Toolbelt and SQL Toolbelt Essentials

Checking on Failed Server Logins, Server Errors and Warnings using SQL Monitor

Product articles SQL Data Generator Generating Realistic Test Data
Pseudonymizing a Database with…