Unmasking SQL Server Dynamic Data Masking, Part 3, Security Concerns

This is part of a series on Dynamic Data Masking by Ben Johnston. For the rest of the series, click here

This is the third part of a series on SQL Server Dynamic Data Masking. The first part in the series was a brief introduction to dynamic data masking, completing solutions, and use cases. The second part covered setting up masking and some examples. This part starts exploring side channel attacks against dynamic data masking.

This article will show that are plenty of security concerns with using Dynamic Data Masking as a general-purpose security tool, but don’t let it completely keep you from using it. The point of the article is to show you the weakness of the tool, particularly for ad-hoc uses with users who might be keen to snoop around the data in ways you might not expect.

There are a lot of ways that users can decode data that is only dynamically masked, and in the next two entries in this series I want to show some of those methods, not to show you HOW to decode masked data, but for two reasons.

  1. To show you why dynamic data masking should not be used to mask very personal data that might be stolen, especially when you do not have tight control over the users.
  2. To help you notice if your users might be doing some of these things.

Side Channel Attacks

When SQL Server 2016 was announced, dynamic data masking was an exciting part of the offering from a security and business intelligence perspective. Allowing end users to access data while masking critical data is a boon. It eliminates the need for custom solutions and makes it possible to expose useful data in a safe manner. The documentation for this feature had a few lines, warnings really, about the possibility of side channel attacks. I had heard of side channel attacks in regard to the operating system and other products but didn’t exactly understand how it applied to masking. The basic examples made sense, but I needed to understand the limits of this to evaluate it properly and see how it might fit into existing architectures.

The real question was, how do these side channel attacks work and what is the scope of their vulnerability? Examples of these attacks are a little sparse. I’m sure some of the paucity of information is explained by security by obfuscation. The best way to find out the vulnerabilities was to carefully test them. The basic test bed is as follows.

  • An unaltered version of WideWorldImporters
  • An account with dbo access
  • A test account with SELECT only access
  • Individual columns were masked

With the test bed created the ability to derive clear text data from the masked data was analyzed.

Exploratory Scripts

Unmasking data starts with finding masked columns. This is a simple flag in the standard system views. The Microsoft documentation also shows a masked column system view (sys.masked_columns), but you have fewer filtering options and it can be helpful to view both masked and unmasked columns together. A likely starting point on a production system is finding a column that is masked during normal data exploration and operations. The standard version of the sample database used, WideWorldImporters, has 5 columns masked in the Purchasing.Suppliers table and the same columns in the associated temporal table masked. If your version of the database starts without these columns masked, you have a different version of the database. After running the setup script to mask columns (which you can find at the end of Part 1 here), there are a total of 38 columns masked. After you restore WorldWideImporters, run both the security and alter scripts included.

The exploratory script shows the schema, table and column that is masked. It also shows details about the column, including data type, and the function used to mask the data. This is very useful information for administrators, developers, support, and also bad actors. 

Partial Unmasking

Unmasking a portion of the data in a column, partial unmasking, can be done easily. This is a function of dynamic data masking, expected functionality, so no surprise here. Without the ability to partially unmask data (categorize data), masking becomes much less useful in a production environment. The only trick to unmasking this way is organizing it in a fashion that makes it more useful or interesting. Creating and assigning artificial categories is the most obvious way to look at data in this way. Broad categories, such as a min / max with an interval of 1000, can be very useful for quick analysis and still expose data. These categories can effectively negate the entire purpose of masking. It would be no help for an ID column, but it would be very damaging for things like salary or bank account balances. This is a good reminder to analyze your business need carefully before exposing data, even with masking.

The output shows the primary key, foreign key columns, and the two masked columns, in addition to their values as ranges. It is safe to assume that the data in the bottom range, 0-1000 is likely 0. This could be validated by using a WHERE clause and only returning rows WHERE OutstandingBalance = 0.

 

This works well with numeric data. The next test is character data. Since functions like substrings and wildcard searches are operational with masked data, grouping data is straight forward.

 

Output shows the key columns, the masked columns, and the partially unmasked columns. Remember that if specific values are wanted, they can be targeted by using the initial partial unmasking to narrow down choices.

 

Using this tactic would allow a bad actor to start chipping away at specific values. If you were interested in a particular client name or individual, this technique allows masked values to be targeted rather quickly. If a particular client name was known, it could be specified in a script to find the row even faster. If the data is completely unknown when you begin your process, it is an effective, but more prolonged, technique.

The targeted record will be returned, even if the WHERE criteria is for the masked column. This is intentional to the implementation of dynamic data masking, and required for it to work correctly, even if it seems like a potential security gap. 

Automated Partial Unmasking

Partial unmasking categories are relatively easy to create. The part that requires thought is deciding where to create the intervals. If you understand the data it is easy. If the data is unknown, it requires some initial exploration. This is also a good demonstration of layered security. This table comes with row level security (RLS) enabled in the sample database. RLS adds another layer of security to configured tables, restricting access to specified rows. There are several ways RLS can be implemented, and in this case the database uses the SalesTerritory table for a city to determine access levels. If the user is in the associated database user role, they are able to retrieve those rows. The user for this example was given partial access by adding MaskedReader to the [External Sales] role, so only those rows are returned. Functions and comparison operators are used on the masked column. In the user creation script, the following command grants access via RLS. Refer to the Filter Predicate [Application].[DetermineCustomerAccess] in the database for details.

The Azure, basic version, of the database does not include RLS and the following script will fail. The full version of the database includes RLS and the role is available.

The histogram buckets are returned very quickly. The number and range of buckets can be easily modified for the specific data and scenario.

 

Finding number ranges automatically

Finding number ranges for a histogram is as easy as joining to an unmasked table. A standard numbers table works well. Specialized numbers tables can be created to tackle known or unusual columns.

Histogram bucket generation can be automated using a standard or specialized number tables. If you have a reasonable guess for the max or the ranges, the composition of the number table can be tailored to the target table. If the masked column contains latitude and longitudes, a standard number table would work for initial exploration. Multiplying the number by 10000 to get the desired accuracy would still work with a standard number table and would identify objects to within 11 meters (according to the GIS wiki, a decimal accurate to 4 decimal places will achieve this result).

The automatically generated ranges are shown below, in addition to the key column and the table name. The output can be analyzed exactly as in the previous example.

This shows how masked columns are automatically added to the output and put into histogram buckets with a range of 1000.

Finding text ranges automatically

Show number and text script together, using CONVERT on all masked columns

Masked columns are included in the output. All column types are treated as text and have been converted in the SQL, then the start of the column is matched using substring statements.

 

Complete unmasking of larger values

The next challenge is exposing the complete contents of a column. This can be a daunting task that takes quite a long time.

The first, obvious choice might be to brute force a comparison. A rainbow table solution. A rainbow table works by creating a table with all possible values. The masked column is then compared against the fully populated rainbow table using a standard join. This is a parent / child relationship as used everywhere in SQL databases. For small numbers, this is a great strategy due to speed of setup and execution time for the query.

This method has limitations since you need to match the entire number, including the scale if it is a decimal or money column. A standard numbers table works well for matching integers if the target values are within your numbers table. The number column is joined to the masked column in the target table. Complete unmasking in this limited scenario is simple and fast. The previous two examples have versions of numbers tables and a similar technique can be used to uncover small columns if a physical numbers table is not available and you can’t create a new table.

Output showing the query output. Decimals with a scale equal to zero return. The sample script also shows the potential match that would catch many more numbers by converting all TransactionAmount to integers.

 

 

The above solution works for integers and also works on decimal columns where the scale is zero due to the implicit conversion of decimal to integer in the query plan. It could be expanded to include decimals of a specific scale as well. As the solution is expanded and numbers are added it will take longer to match rows, the numbers table will be larger, and your investigations will be more obvious. The query also shows that columns that don’t match, likely due to a scale other than zero or an out-of-range number, don’t join and are returned as NULL.

Output shows unmasked decimals with a known scale.

There are a few clear ways to unmask decimals with a specific scale. The numbers table can be built out with all possible decimal values, the masked column can be converted to an integer first and then compared to the numbers table, or the value from the numbers table can be multiplied by a tenth, hundredth or thousandth to match the scale of the masked column.

Creating a full numbers table with all the possible decimal values can take significantly longer to create, takes more disk or memory space, and will make the overall process longer. The actual time will vary depending on the numbers generated.

Converting the masked column to an integer for the comparison in the join is a non-sargable operation and will result in an index scan at the least and possibly a table scan. Excluding the required scan for the conversion, this is a fast operation and easy to implement.

The method I used for the demonstration is using a standard numbers table and multiplying the results by .01 to get decimals to the hundredths position. It is an effective technique for finding an exact match for decimal columns with a minor performance cost, but the precision is limited based on the size of the integers in your numbers table. For most applications, knowing the least significant digits of a decimal doesn’t add value from the perspective of a bad actor. If someone is trying to unmask the value of a current balance in a bank account, they won’t care if the balance is 1000.01 or 1000.99, 1000 is probably close enough.

 The above queries are pure brute-force methods for unmasking columns. The same thing can be done with strings by adding all possible characters to the comparison set. Numbers are faster to unmask due to the limited character set to be traversed and the smaller size of the rainbow table needed to perform the unmasking. Character unmasking using this technique is limited and has a much higher performance cost. It quickly becomes unwieldy and unusable, which is desirable from a security perspective.

 

Specialized scripts for known data formats

Some specialized formats can be unmasked very quickly. This is similar to old NTLMv2 password attacks that broke the password hash into 2 sections. If the format of the column is fixed length with multiple sections, each section can be attacked individually, greatly decreasing the potential time to unmask.

An IP address is a single 32 bit number, but it is represented as 4 bytes. Each of those bytes are generally represented as 0-255 with some exceptions. It would take longer to code for the exceptions instead of just leaving them in the solution. Any invalid combinations will just naturally be excluded when they aren’t matched during the unmasking process.

In the next blog, I will continue along this path and show how you can look for data in specific formats, along with how to just do a brute force search through unknown data.