Data masking is a technique of hiding sensitive data from certain users who should not be able to view or access the actual values. This post will look at the reasons why you might choose to implement data masking and discuss both dynamic and static data masking techniques, and when it is appropriate to use each of these.
Why use Data Masking?
Data masking is a technique used to prevent the viewing of certain data by unauthorized individuals. A very common scenario for data masking is limiting the data a customer service representative can view. A specific example is for organizations that process credit cards. Often a customer service employee can only see the last four digits of the card, which are used to verify the identity of the customer on the phone.
Data masking is a way of reducing the attack surface area of your organization by limiting the places in which data might be unnecessarily exposed. This reduces the risk of potential problems and possible repercussions from outside organizations. These could be fines from regulatory organizations, loss of certification such as ISO9000, or churn from customers that are concerned over data handling practices.
While usually associated with PII (Personally Identifiable Information), there may be other reasons why an organization chooses to mask other data from employees. Salary data and names might be masked from developers that are working on a finance application to prevent them from knowing how much compensation another individual receives.
This masking can take place in both production and non-production environments, such as development and test database. It could also be used in training or other purposes where the exposure of data is unnecessary.
There are two main techniques used to mask data: obfuscation and redaction.
Obfuscation is the changing of data in a way that alters the meaning of it. The data still looks “real” or valid, but it is no longer the original value. This can take place with the full data value or a partial value.
A partial obfuscation of data would be something like the following change. The original value is on the left and the changed one on the right.
Steve Jones -> Sam Jones
In this case, we have changed the first name of “Steve” to “Sam” and left the last name unchanged. While this is a minor change in a data set, it reduces the chance that Steve will be identified as the individual. This is a technique that might be used to hide data in other ways, such as changing the street name in an address or the month in a date.
We can also perform a full obfuscation of the data, as shown here:
Steve Jones -> Bob Smith
In this case, we have changed both the first and last names to something completely different. This is often the preferred technique when data is needed that looks real, but we want to provide plenty of protection against accidental disclosure of the actual values.
This could be used in numerous ways where partial changes are not possible. An example would be a stock portfolio. While changing part of a stock symbol would result in an invalid value, changing to a completely different, but valid, symbol would allow development groups to use software that performs lookups without knowing the original portfolio structure in production databases.
Redaction is a technique to remove some information from original value, but in a way that the reader knows the values is changed. Here is an example:
Steve Jones -> Steve JXXX
In this case, we have changed the ending of “Jones” to “JXXX”. This shows that information is removed, but we do not know how much. There are 3 Xs, but four letters from the name were removed. With this type of redaction, the name could be “Jones”, “Johnson”, “Jobs”, or any value that starts with a J.
We could also perform a redaction in a way that hides the data but keeps the correct number of characters. The credit card scenario is often a place where we redact most of the information, but not all. For example, many applications will show this to a customer service representative:
4111 1111 1111 1111 -> xxxx xxxx xxxx 1111</code?>
Here we keep the 16-digit shape of a credit card number but hide the first 12 digits. This is often used with US Social Security numbers, phone numbers, and many other cases where the format and number of digits are important for both software checksums and human comprehension of the use of the data.
Choosing a Data Masking Technique
Both obfuscation and redaction have their places in protecting data, but which one should you choose? The answer is that you can use either one, depending on the situation in which you are faced.
For production data, where we may have users authorized to see some data, but not all of it, redaction is a better choice. This ensures that users are aware that some data is masked, and they are not authorized to see it. They also know that they can trust the data that is visible as accurate.
For development or test environments, either technique can work well. Redaction is often useful when you want the developer to be aware that the data is not accurate. For example, changing email addresses to email@example.com is a good way to ensure developers and testers are aware this is not real-life data. The extra Xs help call attention to the fact that this is a fake email address.
Obfuscation is useful in places where people working with data want this to look as close as possible to real data. Often business users that evaluate new features and functions prefer to see data that is familiar. Using obfuscated values is helpful in allowing users to focus on the software and not get distracted by redaction characters.
The Challenges of Dynamic Data Masking
In real time scenarios, when we need to mask data as it is accessed, dynamic data masking is often used. In this situation, the data is changed between the time that it is read from a storage device and when a user can view it. The original data is not changed, and it can still be accessed by users with the privileges to do so.
The data can be changed in several ways. For example, a function or stored procedure can assess a user’s rights and then use a
REPLACE function to change the data read from disk. The database platform may also have features built into it, such as Dynamic Data Masking in Microsoft SQL Server or Oracle Data Masking. There are also third-party products that can insert themselves into the query stream as a proxy or a filter driver.
In these situations, it is up to someone to configure the data masking to be comprehensive and appropriate based on the requirements of the situation. Usually, some combination of redaction formulas and platform security is used to determine how to redact data. This works well in production environments, though administrators must be careful to ensure that all data is masked appropriately.
However, there are a few issues with using dynamic data masking in development and test environments. First, as entities evolve and are added to schemas, new data is added. It can be a challenge to ensure that production data is protected as the development databases are refreshed.
The other issue is that often developers have administrative privileges in development databases. They can use various query tools to access data and evaluate whether their code is working correctly, which also means they can query the original data, which defeats the purpose of masking.
There is also the challenge of maintaining any third-party software with upgrades and patches over time and ensuring this is installed on all client computers or servers.
Static Data Masking
Static data masking differs from dynamic masking in that the original data is changed in storage. Just as a user can update a value in a row, the static data masking solution changes all values to prevent anyone from querying the data for the original value.
Since the original data is changed, this technique is not suitable for production systems, as that data must be preserved. This technique is best suited to development and test environments where less privileged users, and often less security, are the clients of the database system.
Changing data in a way that suits a relational platform, with data repeated in various entities, is complex and difficult. If the end result is to be read by humans, then we would prefer to have tools that change “Steve Jones” to “Bob Smith” rather than “Customer001”. But this is difficult to do with homegrown scripts that are built and maintained by developers and system administrators.
Static data masking has been used by system administrators for years in an ad hoc fashion. Once a copy of a production database is made for development purposes, custom scripts are often run to change data through obfuscation techniques, lowering the risk of using production data in a less secure environment. However, this takes time and allows for human error as it is another task that needs to be executed after a database is refreshed. There is also the challenge of knowledge transfer across staff, as these scripts can be very complex and confusing to maintain.
DevOps with Data Masker for SQL Server and SQL Data Catalog
Modern software development proceeds with a DevOps approach to tasks, where we ensure a process is followed, automation is involved, and there are regular reviews and improvements of the process itself. Static data masking fits easily into this flow when the appropriate tooling is used to perform the masking as a part of the software development process.
Redgate’s Data Masker, along with the SQL Data Catalog, help ease the challenges of masking data and knowing which data to mask. Each of these products helps ensure that a process is followed which reduces the chance of human error that can lead to accidental data disclosure.
Data Masker has a project format that allows knowledge transfer between developers and administrators as each step in the masking plan is documented, described, and clear. As a plan is built, it can be applied automatically as part of a database refresh process and run with multiple threads to reduce the time needed to update development databases.
Data Catalog allows users to classify the columns as they are added in development and deployed to production systems. This information can be used to automatically generate or update a masking plan that Data Masker will use to apply the changes in a static data masking process. You can read more about these two tools in Continuous Data Protection with SQL Data Catalog 2.0 and Data Masking.
Both static and dynamic data masking have their place in the data protection area. Most organizations can make use of these two techniques to ensure they do not accidentally disclose data to unauthorized users.
Dynamic data masking is better suited to production use, though care is needed to ensure this is comprehensive in scope and all PII data is protected. Static data masking works better for development and test environments but can be cumbersome to implement. Some tooling is needed to ensure that this technique meets your needs across time.
The need for better data security and more appropriate data handling grows all the time as certification bodies and regulatory agencies require organization to adhere to new rules. Every organization ought to pay attention to how they appropriately secure their data, and how they protect it in development environments to prevent a data breach from occurring.
You can find out more about both dynamic and static masking, and see me demonstrate the techniques I’ve outlined in this on-demand webinar, Why you need to go beyond dynamic data masking.
Was this article helpful?