Unmasking SQL Server Dynamic Data Masking, Part 4, Unmasking Formatted and Unformatted Text

This is part of a series on Dynamic Data Masking by Ben Johnston. For the rest of the series, click here

Continuing from the previous entry in this series on security concerns with using Dynamic Data Masking, in this blog I want to close out that discussion showing how you can unmask different sort of data that has been dynamically masked, even if you don’t have access to unmasked data.

Clearly this is NOT a suggestion for how you might break the text but is far more of an exercise to show you how a bad actor may attempt to look at your data in ways that would generally not cause red flags.

It is especially important to reinforce the sentiment that Dynamic Data Masking less of a security tool to prevent attacks, but more to hide data from general viewing, and as a tool for building applications where the data still is accessible in some scenarios and not others.

In this blog I will focus first on known formatted values, then will show a technique for brute force unmasking data. The big thing to understand about dynamic masking is that unmasking by a bad actor is only necessary if they are accessing the data as a normal user. If they have acquired your database, they will have unfettered access to the data, unlike if you use any kind of static masking. Static masking is used to hide data values permanently, as the data is overwritten. That makes it great for development data, but not when the masked data has a meaning to other users.

Formatted Text

The process for unmasking IP addresses, social security numbers (SSN), or any specialized format, is to understand the valid values for the format, break the column down into manageable sections, join to a table containing unmasked versions of the entire target set, and show the results using a combination of the masked table and the unmasked table. When joining to the unmasked table, the join will typically happen multiple times, one join for each masked section. The multiple joins to these small sections make it very fast but they also complicate the query a little. Considering that unmasking data isn’t normal functionality, the complexity is to be expected, and is like any other join.

IP Addresses

For the IP address unmasking a simple numbers table is used. Each octet is separated using a substring and joined to the numbers table. This is where the data exploration becomes critical. IP addresses are represented using a standard format, but that doesn’t mean the database developers storing the IP address followed those standards. If someone was trying to save space or just used a different format for unknown reasons, that format will need to be determined.

Changing the delimiter is not a substantive change to the unmasking process since the delimiter is skipped in this sample. In this first script will set up objects to contain some fake IP addresses to unmask. (Note, setup is the same as used in part 3)

Next, I will use the following script to unmask the IP address values we just created and captured.

Output from the above script is shown below. The IP octets are broken apart and attacked individually, making the processing much faster.

A picture containing text, font, screenshot, number

Description automatically generated

Social Security Numbers

Social security numbers are typically stored as a string and will either contain dashes with a format of 000-00-0000, or without the dashes as a string of 9 numeric characters. This can be broken down and managed in sections, like the IP address example, or it could be done as a single numeric string.

The numeric string can be done via a very large rainbow table. A very rough estimate of the size of the table 999999999 (rows) * 9 bytes. Using my handy SQL Server calculator, that is over 9.3 GB. It will get noticed by administrators or developers watching the database. It will also take some time to create. This is a good example of the issue with rainbow tables. They also get more complicated as you add characters. The example above uses 0-9 only. To add basic Latin characters, you go from 10 to the power of 9 possible combinations to 32 to the power of 9. Clearly using brute force via a rainbow table is not a good option for anything larger than a varchar(5) or maybe varchar(6), depending on your space.

 As you can see below, it quickly gets out of hand.

Character Set

Maximum Column Size

MB

GB

0-9 (10)

9

 

9.3

0-9, a-z (36)

10

 

33.5

0-9, a-z (36)

10

 

335.3

 

Unmasking with a single table and join is simpler, but the setup takes longer and requires much more space. This falls in the category of suspicious activity and may get DBAs looking at your queries much faster. This is the kind of activity we will discuss in countermeasures. If you are using this to unmask data where you have legitimate access, it is much easier to just run the query as a different user. If you find a table like this on your database, especially a temp table, you should do some investigations and wonder about the purpose of this table.

The example below shows unmasking via a single join against a very large table and breaking it down into sections. The version using sections is very specialized but executes faster and doesn’t require a huge support table.

First, I will create some sample data to work with:

Using the table we just created, will try to brute force match the data: (Note, read on before executing this script as it will take a very long time and use a lot of resources.)

As noted, the script is very resource intensive. The basic tier in Azure will run out of room in the transaction log after about 3 hours of attempting the script.

Msg 9002, Level 17, State 4, Line 19

The transaction log for database 'tempdb' is full due to 'ACTIVE_TRANSACTION' and the holdup lsn is (250:976:12).

It could be modified to use a cursor or a standard table and work in Azure and not fail for resource utilization and be quite efficient.

Unmasking the social security number by attacking the individual pieces happens very quickly, about 5 seconds on the lowest Azure tier (Basic, 5 DTU) for 300 rows. Note that the unmasked data results of this query will vary as the sample data is randomly generated, but the same 7 rows will be leading the result set.

A screenshot of a computer

Description automatically generated with low confidence

Next I will clean up the table I created (note that if you have cancelled one of the previous queries, you may not have executed the right number of REVERT statements and could get an error that the object does not exist (or you do not have proper rights to drop the object. Execute REVERT again a time or two until SELECT SUSER_SNAME(); returns your advanced security principal you are operating as.:

Dates

In this section I will show a few methods of showing the data in a masked date column.

Dates are effortless to unmask, especially if a date dimension is available. A date dimension is a table with one row per date, usually for every row that could possibly show up in a reporting query. It allows you to include the day of the week, month, etc, that you could fetch from system objects, but also custom things like company holidays.

If a date dimension doesn’t already exist in the target database, it can be created with a recursive CTE. Select the primary key column, the date column from the dimension table, then join the target table to the date dimension. The date is unmasked at this point. A similar process can be used for time data.

You wouldn’t want to unmask the entire column as a datetime, but they could be tackled in the same query by unmasking them separately. As in previous examples, small sections are much more efficient. More than 70 thousand rows can be unmasked in a second and with a very small data footprint by targeting the date and time sections separately.

Dates are unmasked very quickly by comparing against a date dimension.by simply joining to the date column (casting any date values with a date part to a date datatype).

A picture containing text, font, screenshot, number

Description automatically generated

The same could be done with the time portion if needed by removing the date part of the column and creating a table with the grain of the data you are interested in by creating a table with a row per second (there are just 86400, and if you needed milliseconds, it could be done with just 8.64 million rows). Finding the seconds since midnight for a date and time value is done using:

Brute Force for Unknown Data Format

The method to unmask entire number columns quickly and efficiently was answered by exploring methods to unmask character columns. The examples above were easily expanded to separate each byte in a string and perform the comparison. To do this with a number, you simply convert the number to a string first. Doing this with numbers is even easier than strings because of a greatly reduced comparison set to be traversed (0-9 and maybe A-Z.).

The next unmasking challenge is to make this a generic process. Masked column information is available in the standard system views as shown above, in the Exploratory Scripts section. If each column is converted and treated as a string, they can be brute forced using a single pattern. This includes a lot of code that you wouldn’t run in a production script but for a script of this nature it’s tolerable. This wouldn’t be deployed as part of a project but used as a demonstration of vulnerabilities or to tune and test audit scripts.

This solution builds on the previous specialized solutions. If you’ve been following along closely, you probably saw this solution sooner than I found it. The trick to making the solution dynamic is splitting each column into a series of characters and attacking each character individually. This can be tested with a single column of known maximum length, preferably a short column. After proving that it can be done, the rest is just simple SQL coding.

To perform a proof-of-concept, a varchar(3) column is used as the target column. A temp table is filled with ASCII values from 1-255 (the rainbow table), each of the 3 characters in the column are separated into individual columns using substring, each substring of the column is joined to the newly created rainbow table, and results are displayed using CONCAT. For a proof of concept, the unmasked values could also just be shown individually, but the CONCAT makes reading easier.

There are a few things to note in this POC. The collation is specified as Latin, Case Sensitive. This eliminates the duplicates from the upper-and lower-case char values that would be generated using the full ASCII set via a Cartesian join. It also preserves the case of the masked data. To simplify the code, the CHARACTERS_CTE could be joined rather than using the CHAR function for each join. The code executes in less than 1 second, so optimizations aren’t needed. But for larger tables and wider columns, every optimization helps.

The above script is a steppingstone to complete, automated unmasking. It is manually configured, including each character in the column.

  A screenshot of a computer

Description automatically generated with low confidence

The above example shows how to unmask almost any column, but it is a little cumbersome due to the manual nature of the script. The table and column were both selected by hand and the query was crafted specifically for that column. The next step to automation is unmasking a column without specifying the length of the column beforehand.

The more automated version took 1 second to run. That is a very small difference but on large tables it would clearly get larger. The automation and extra functions come at a cost. Much of the base code is the same, but the size of the column is determined dynamically, and the columns are joined automatically. The STUFF function with FOR XML will be familiar to any SQL developers looking to automatically concatenate columns. It greatly simplifies the code needed to unmask columns. The primary key for the table, CountryID, is used in the ORDER BY, as is the Number column. The Number column is simply the order for each character. The Number column is also used to generate the CHAR values used for each comparison.

The output for this script is the same as the previous script, but it is fully automated for this table. The table and key columns are specified manually, but the matching happens based on column position. 

A screenshot of a computer

Description automatically generated with low confidence

Automated Brute Force Attack

Creatin a more fully automated dynamic brute force attack builds on standard dynamic SQL and the techniques shown above. This script is more complicated than the previous scripts, so each section is broken down. Variables are set at the top of the script for each run. This includes the table schema, table name, if the script should run or just print to screen, if NOLOCK should be used, if only the masked column names should be shown, and any columns that should be excluded from the process. If the table that is specified does not exist, is in a different database, or doesn’t contain masked columns, the script will exit. Internal variables are also created in this section.

Masked columns are identified next, including names and sizes. Text columns are differentiated from numeric columns to potentially speed the process and the columns specified for exclusion are left out of the result set. If any of the columns contain text characters, the full ASCII character set is used. Otherwise only 0-9, period and comma are included.

Primary keys are identified in the next section. This is needed to join the unmasked columns back to the target table. It is also needed to order the unmasked characters in each column. The primary keys are then used to create the join statements and order statements. These statements are used later in the dynamic stuff commands.

The next section is where everything is organized, and the SQL statement is created. Numbers and character CTEs are defined first. They are used multiple times in the script, for both the column CTEs and the stuff commands.

A separate CTE is created for each masked column, named ColumnName_CTE. This would be a problem if the data team allowed crazy characters such as spaces in the column name. If that’s the case, a clean string function could be added or the replace function could be used to target specific issues. The numbers CTE is first used to parse each character in the column, then the characters CTE is used to match each masked character to the unmasked character list.

At this point, the data is unmasked. The remainder of the script is focused on presentation. The server, database, schema, and table are listed. The base information is followed by the default columns, including the masked data. Finally, each masked column CTE is used in a stuff statement to present the unmasked column. The stuff section is identical to the proof of concept, it’s just dynamically repeated for each column.

The results of this query is un maskeddata.

 

The script demonstrates the ease with which data can be unmasked. Unmasking data requires direct access in addition to technical knowledge, but it can be unmasked. The generic approach to unmasking isn’t as efficient as specialized strategies but it requires virtually no exploratory work and is fast enough for a one-time procedure. If a bad actor is going to maliciously unmask a table it likely isn’t something they would perform multiple times. The data would be unmasked and saved off to a file or table. This makes efficiency less important and it also makes it more difficult to detect an attack.

Summary

SQL Server Dynamic Data Masking is potentially susceptible to side channel attacks. This will likely limit the scenarios where you will want to allow direct access to data. Additional implications of this and mitigations are explored in the final section of this series.