Imagine that you are the CIO of AdventureWorks. Out of a blue sky comes an order from Taxman that you supply details of all your sales, along with the tax charged to your customers. They want to be able to check that the total tax you pass on to them is correct, and is the same as the tax charged to the customers. They need the details of the purchases, without needing the details of the customers.
There was a time when you might have shrugged and just sent them a copy of the relevant data straight from your database, or even sent them the entire database. Nowadays you need to be more careful. After all, imagine the lurid stories if AdventureWorks became one of those databases that was available ‘everywhere’, after being copied onto the Dark Web? No, before distributing the data, you must obfuscate all the identifiable details of your customers, whilst still giving the required information to people who have a right to inspect your books.
After a difficult conversation with the team leader of your ops team, you decide that because of the complications of identifying and masking out all the sensitive data, it is impractical to deliver a ‘pseudonymized’ copy of the relevant tables from the database. Instead, you want to generate a data-masked version of your standard report, which the Tax Men can pore over.
How? Your developers say they could modify the existing reports slightly but obfuscating various fields, using SQL, would take a long time. Dust out that copy of SQL Data Generator (SDG) and upgrade it, if it tells you there are later versions, because with a little creativity it can do all the hard work for us.
Create the Reporting view
We’ll create a view containing all the data we need for the report to the tax man. Listing 1 just creates the single view we need, but you can create as many of these datasets as you want or need, within a single SDG project.
CREATE VIEW ReportForTaxMan
SELECT Person.Title, Person.FirstName, Person.MiddleName, Person.LastName,
PhoneNumberType.Name AS Type_of_phone, AddressType.Name AS TypeOfAddress,
Address.AddressLine1, Address.AddressLine2, Address.City, Address.PostalCode,
StateProvince.Name AS State_province, CountryRegion.Name AS Country,
SalesOrderHeader.ShipDate, Purchaser.Title AS Purchaser_Title,
Purchaser.FirstName AS Purchaser_Firstname,
Purchaser.MiddleName AS Purchaser_Middlename,
Purchaser.LastName AS Purchaser_Lastname,
Purchaser.Suffix AS Purchaser_Suffix
INNER JOIN AdventureWorks2016.Person.Address
ON BusinessEntityAddress.AddressID = Address.AddressID
INNER JOIN AdventureWorks2016.Person.Person
ON BusinessEntityAddress.BusinessEntityID = Person.BusinessEntityID
INNER JOIN AdventureWorks2016.Person.AddressType
ON BusinessEntityAddress.AddressTypeID = AddressType.AddressTypeID
INNER JOIN AdventureWorks2016.Sales.SalesOrderHeader
ON Address.AddressID = SalesOrderHeader.ShipToAddressID
INNER JOIN AdventureWorks2016.Person.PersonPhone
ON Person.BusinessEntityID = PersonPhone.BusinessEntityID
INNER JOIN AdventureWorks2016.Person.StateProvince
ON Address.StateProvinceID = StateProvince.StateProvinceID
INNER JOIN AdventureWorks2016.Person.CountryRegion
ON StateProvince.CountryRegionCode = CountryRegion.CountryRegionCode
INNER JOIN AdventureWorks2016.Person.PhoneNumberType
ON PersonPhone.PhoneNumberTypeID = PhoneNumberType.PhoneNumberTypeID
INNER JOIN AdventureWorks2016.Sales.Customer
ON SalesOrderHeader.CustomerID = Customer.CustomerID
INNER JOIN AdventureWorks2016.Person.Person AS Purchaser
ON Customer.PersonID = Purchaser.BusinessEntityID;
This view will be our data source, in the SDG project, but as also need to create an empty ‘target’ table, which SDG can use to create the masked report data. Listing 2 will build the empty
--we are creating a table do ensure it isn't there already
IF Object_Id('dbo.MaskedReport', 'U') IS NOT NULL DROP TABLE dbo.MaskedReport;
--this will be an empty table if we put in an impossible condition
SELECT * INTO MaskedReport FROM dbo.ReportForTaxMan WHERE 1=2
--so at this point we have a view with the real data and an empty table
--that will hold the masked data. We will then export this masked data
--SQL Data generator will fill this MaskedReport table
Creative use of the Existing Data Source in SQL Data Generator
Now we start SDG, create a new project, and specify AdventureWorks as our database.
You’ll notice our empty
MaskedReport table is listed. Click on it and you’ll see, in the lower pane, that SDG is straining at the leash, and has figured out strategies for filling in this table with fake data. This might be OK for user training on the AdventureWorks applications, but it will irritate the people in the Tax office more than a little. No, we must first introduce the real data, and then fill the corresponding columns before we then mask the columns that reveal personal data.
Therefore, we need to deselect every table and view (use the button at the bottom), except our masked table.
Instead of generating the data, we want SDG to take the data from our
ReportForTaxMan view, so set the table generation settings to use an existing data source.
Hit the Browse button, specify the connection to the same Adventureworks database and select the view.
Hit the Finish button, and SDG will populate the preview pane, for
MaskedReport table, with the original data from the view.
Masking the personal or sensitive data
Now we can overwrite the real information with the fake data, where necessary, column by column. I’ll just show how to mask one column,
PhoneNumber, to make the point, since the tax people don’t require these customer details.
Click on the column and the Column generation settings pane appears, above the data preview pane. Click on the dropdown list that specifies the current generator (which you’ll see is our view), expand the Personal category of generators and choose the Phone Number generator.
We can use the Phone Number generator without modification, but in my previous article I’ve offered plenty of examples of how to customize SDG’s built-in generators, as required.
We can then go on to do as much data masking as we need. Having obfuscated all personal or sensitive data, save the project and then hit the Generate Data button to run the generator, and populate the
MaskedReport table. Any existing data in the table will be cleared out, and replaced with a mixture of masked and real data.
Finally, inspect the contents of the
MaskedReport table in SSMS to make sure it all worked correctly.
This dataset is now ready for export, via BCP or by doing a JSON export! If you have several views, each view will insert into separate equivalent tables, of course. I’ve just shown the simplest example of how we can use SDG not only to generate fake data, but to fill tables with a mix of real and generated data.
Build an obfuscated version of the entire database
We can extend this technique to obfuscate the personal or sensitive data in an entire database, although this technique doesn’t scale well to large databases, and you’d be better off with a tool such as Data Masker for SQL Server, which also offers a range of techniques to scramble, shuffle and mask existing data, rather than just replace it with fake data.
However, in SDG, to do this we create a new version of the database, and start up SDG as if we were filling it with faked data. Select a table, and specify that SDG should populate it by mapping to another SQL table or view, as demonstrated in Figure 2 to 4 above.
For each subsequent table, it defaults to that same connection and pre-selects the table with the same name, which is a lot quicker. When you click the Finish button on the wizard, SQL Data Generator matches columns in the two tables based on data type and column name. You can then dal with any columns that need masking.
This works fine for small databases but doesn’t scale well because setting up the project initially requires every table to be specified individually, as using an existing data source, with a default connection. There is no button that allows you to match, at a stroke, all the tables of the empty target database with their counterparts in the source database.
Building just the part of the database that you want
Instead of selecting all the tables, you select just the related tables that provide the data that has been requested. You will need all the tables that are referenced by your
KEY constraints: they will need theirs too, and so on. In our case, we can stop with
Person.Person. This enables us to fill just a small subset of related tables. We then specify that we populate an entire table, or several columns in a table, by mapping to another SQL table or view, and then over-ride individual columns that you wish to mask,
Sometimes it fails because you’ve forgotten a dependent table. If so, concentrate on fixing the first error because these will cascade. If SDG fails with a table it presses on regardless, without the table, and you get errors compounded.
SQL Data Generator was, as its name suggests, intended to generate the data for databases, as required for testing, development work and training. However, it is perfectly at home with using the original data as the default source of the new data, and then substituting in fake data for any column that you want to mask. This is fine for a wide range of masking purposes, though it is slow and burdensome if you need to obfuscate an entire database. For this you’d need a tool such as Data Masker.
However, many requests for data extracts involve only a handful of tables, such as a customer list, and this is very easy to do in SQL Data Generator.
This is a guest post from Phil Factor.
Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.
Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.
He is a regular contributor to Simple Talk and SQLServerCentral.
Also in Hub
Using TOP in a SELECT statement without a subsequent ORDER BY clause is legal in SQL Server, but meaningless because asking for the TOP 10 rows implies that the data is guaranteed to be in a certain o...
Also in Product learning
People can, and do, argue a great deal about the relative merits of table variables and temporary tables. Sometimes, as when writing functions, you have no choice; but when you do you’ll find that b...
Also in SQL Data Generator
Generating realistic test data, which reflects accurately the nature and distribution of the data it is emulating, is a challenging task. The task is made more complex if you need to generate that dat...