Deterministic Data Masking in Redgate Test Data Manager
Redgate Test Data Manager is designed to provide secure, anonymized, and representative copies of production databases. This article focuses on the role of deterministic data masking, transforming PII consistently across tables, even when no logical relationship exists between them, and explains how it works.
Purpose of data masking
The role of data anonymization has become more important for organizations due to the rightful reclamation of individual rights through Data Subject Access Requests and the development of laws and regulations such as the General Data Protection Regulation (GDPR).
Redgate Test Data Manager is Redgate’s answer to this problem for any data that resides in a RDBMS. It allows organizations to make representative copies of their production databases available for development and testing work, and other purposes, while safeguarding any PII data.
The Anonymize component in Redgate Test Data Manager provides a ‘classify’ command that will help organizations automatically identify and categorize Personally Identifiable Information (PII) and a ‘mask’ command that will then automatically obfuscate the classified information. Users can customize the masking to their requirements.
Masking data while preserving data integrity and consistency
Redgate Test Data Manager will mask all sensitive data that it identifies in each table, while seeking to maintain relational integrity and data consistency in the database. This is relatively straightforward in a relational database that has been normalized to remove data redundancies, and with well-defined Primary and Foreign Key relationships, and other constraints, to enforce referential integrity and data consistency.
However, in many databases, tables with no logical relationships will sometimes contain duplicate data (e.g., due to denormalization). In such cases, a standard data masking process will introduce update anomalies, where the same PII data in two places is updated to different masked values. This will result in a masked database where the data is inconsistent and less useful for testing, analysis or reporting. To avoid this, RG TDM supports deterministic data masking.
The following sections will explain how RG TDM performs both non-deterministic and deterministic masking operations, depending on requirements:
- Standard (non-deterministic) masking – masks all PII but produces a ‘random’ masked value for every input value.
- Deterministic data masking – uses a temporary ‘salt’ (sometimes called a ‘seed’) to consistently mask the same PII value to the same output, within a single masking operation.
- Persistent deterministic masking – uses the same salt across repeated masking operations to ensure the same value is consistently masked every time, even across different databases.
Standard (non-deterministic) masking
In ‘standard’ data masking, the masking command only needs to know the unique identifiers in a table to identify which row to update and will mask each column that contains PII with a replacement value of the correct type. For example, let’s say we have a table named Teachers that looks like this:
teacher_id | name | |
1 | John | john@redgate.com |
2 | Jane | jane@postgres.com |
3 | Simon | simon@mysql.com |
For the name column, the masking process will generate update statements akin to:
1 2 3 |
UPDATE teachers SET name = 'generated value' WHERE teacher_id = 1 |
In this example, it ignores ‘John’ and replaces it with ‘generated value’ which can either be randomly picked from a predefined dataset or generated from a given pattern. For example, let’s say the replacement values for the Name column are selected from a pre-defined dataset of common names:
1 | Adrian |
2 | Beca |
3 | Daniel |
In this very simplified example, a Pseudo Random Number Generator (PRNG) would generate a number between 1 and 3 and the value corresponding to that number would be used to replace ‘John’. A similar process is used to mask the email column, and the resulting masked table might look like this:
teacher_id | name | |
1 | Daniel | daniel@example.org |
2 | Beca | beca@reinserted.net |
3 | Adrian | adrian@mono.com |
However, what if the email column was duplicated in another table, such as a Lessons table? Whereas john@redgate.com has become daniel@example.org in the Teachers table, that data might be masked to adrian@greenroom.org in the Lessons table, creating a database with data that is inconsistent and unreliable.
For example, any queries needing to correlate data from different tables, based on masked fields where no defined relationship exists, would produce incorrect results. Conversely, reliable and consistent test data would lead to improved reproducibility, for example allowing QAs and Devs to recreate and fix bugs and other issues quickly. Also, by removing unpredictable variation in the output of masking, the process becomes much easier to maintain and automate.
Deterministic data masking
In databases that contain data redundancies, for whatever reason, deterministic masking ensures that the same input always results in the same masked value, maintaining consistency across tables, even when no logical relationship exists between the tables.
This is important in cases where, for example, a column in one table, a Customers table for example, contains PII that must be masked. However, that same data is used to track customer service interactions in a different part of the system. By ensuring the data is masked consistently, the data continues to provide meaningful results even when obfuscated.
How deterministic masking works
To work with determinism, the masking command can no longer simply ‘disregard’ the original values in a PII column and replace them with values generated by a PRNG. Instead, the mask command will generate, or is provided with, a salt value that predefines how the original values are transformed to masked values.
It works rather like salting in cryptography. If given a salt like “salt123” (in practice it is a long, randomized string), the mask command will combine every value to be masked with this salt, so John@red-gate.com becomes salt123John@red-gate.com and alice@microsoft.com becomes salt123Alice@microsoft.com, and so on.
The original value combined with the salt is fed into a hash function, producing a hash value that is then fed into another, cryptographically secure pseudo random number generator (PRNG). In PRNGs, the salt value determines the sequence of generated pseudo-random numbers. It can be understood as the starting point of the algorithm.
Using the same salt allows the PRNG to reproduce the same sequence of generated numbers, making the process deterministic for a given data value. So, when we run masking, the value john@red-gate.com will always be masked as adrian@example.org.
In this example, we do not explicitly provide the salt, so it uses a PRNG to randomly generate a single-use salt that is discarded after each masking run. So when we run masking again, for example to refresh development instances with the latest production version, a different salt will be used and we’ll get different, but still consistent masked values.
Persistent, deterministic masking
Every time we need to refresh development databases with the latest production data, we will need to re-run the masking process. New data will have been added to production, and some existing data may have been altered though the majority is likely unchanged. However, if we use a generated salt each time then the masked data will contain entirely different values to the previous run.
Sometimes each time we sanitize the data by running masking, we need the existing data to be masked in the same way. In other words, we need john@red-gate.com to be masked as adrian@example.org every time we run the masking process, and perhaps even when running the same masking process on a different database.
Some types of integration test, for example, require a standard input and must check the output against the correct result. For example, when testing a purchase process, every part of the purchasing process works as defined by the business, and all appropriate tables are updated as expected. If the input data suddenly changes, the process will come up with a different result.
To achieve this form of persistent determinism, we need to securely store the salt and provide it to every masking run, as follows:
anonymize mask --deterministic-salt "my-secret-salt"
Making deterministic masking secure
Without the salt, deterministic masking would be insecure because if you could work out the masking algorithm, you’d be able to reverse masked values back to the original values.
The salt ensures that even if an attacker has access to the masked data, they cannot easily uncover the original data without knowing the salt. Any PRNGs used to generate these salts must be strong enough to protect against attacks where someone tries to predict or reverse-engineer the sequence of numbers. In Redgate Test Data Manager these salts are always generated using a cryptographically secure Pseudo Random Number Generator with a very high level of unpredictability and which provides forward secrecy.
Of course, if you are storing and reusing salts, then it is crucial to treat them as sensitive secrets and store them securely, such as in a key vault. See the Security Considerations section of the documentation for more information.
Conclusion
While anonymizing data can protect the privacy of individuals, it would be better if the anonymization process can reliably and securely produce a predictable set of outcomes allowing organizations to safely harness the potential of their data. That is where deterministic masking comes into play.
If you’d like to know more about how Redgate can help, please check out our test data management solution, Redgate Test Data Manager, or get in touch!