- What is data masking?
- Why is data masked?
- Assuming I know what data needs disguising, how do I perform data masking?
- What are the risks of masking data?
- Why would I ever need to mask data?
- Is masking an effective way of allowing us to use real data?
- Can’t you just tell me what data is sensitive, so I can mask it out?
- Why didn’t previous generations of application developers have this problem?
1. What is data masking?
Data masking is a term that covers a range of methods of obscuring sensitive parts of the original data that is held in databases. These methods vary according to the type of data, and the purpose for which the masked data is intended.
The brute-force technique for masking overlays the sensitive data with a masking character such as an asterisk (*). A refinement of this is ‘scrambling’. This approach is used to mask a string value sufficiently to make it impossible to recognize the original data. This is fine for some purposes such as pure scalability testing, but won’t help other requirements.
Where the obfuscated data needs to have the semblance of real live data, you can substitute values from a list or reverse-regex for the existing values that need to be masked. This is fine for training staff on applications and for pseudonymizing data that is destined for analysis but is insufficient for testing or development work because you lose some of the important characteristics of the data that determine the best way of indexing it. To get around this problem of distribution, you can shuffle the existing data such as foreign keys within the columns, doing so by individual columns or groups of columns. The latter way is preferred where the distribution of data, or the variation in values needs to be maintained. However, testing may require more verisimilitude than this because databases have validation processes. These need to be tested with valid data that looks real: postcodes need to reference the same area as the address, credit card numbers need to be valid, and gender may need to reflect correct values. The techniques for doing this can quickly become complicated, particularly when table data has row-internal synchronization rules due to calculated columns, but it can be very effective.
Numeric data and dates can be subjected to ‘numeric variance’ manipulation, where values are changed in ways that conform to the original distribution of the data. Long text can be manipulated by the ‘parodist’ technique based on the Markov algorithm.
2. Why is data masked?
Data is masked when it leaves the security of the database to help to prevent it from being misused for crime, to preserve privacy, to protect sensitive commercial data, and, more generally, to protect the rights of the owners of that data. It needs to be masked and encrypted when real data is required for legitimate reasons such as reporting, business analytics, epidemiology, development work and application testing. It is done as part of a range of measures in order to reduce as far as possible the risk of data breaches or leaks, and to minimise the exposure of data. It’s done to ensure that personal or sensitive data is modified in such a way that the data can no longer be attributed to a specific data subject without the use of additional information.
3. Assuming I know what data needs disguising, how do I perform data masking?
There are three main techniques:
Dynamic Data Masking
This is done for the purposes of training, user-acceptance testing, and wherever an application is unable to limit the display of data to just what is appropriate to the logged-in user. Data that appears on the screens of call center operators can have masking dynamically applied according to the role of the user, so that only the data that they actually need is visible. The database applies the masking technique ‘on demand’ just for the data that is requested. It cannot be used in any circumstances where the user is able to make ad-hoc queries against the data because it can be easily defeated using SQL.
On-the-fly Data Masking.
This is a type of replication that allows development systems to have masked data from the production database, so that the masking process doesn’t need to be repeated every time a development sprint is initiated. Basically, test and development maintain a masked version of the data that is kept up-to-date by re-synchronizing it with the production data, but in a masked form.
Static Data Masking
This process does the data-masking to a copy of the unmasked data every time it is required. This would, for example, take a restored version of a database, or load the data from file, and make the required changes to it to pseudonymize it.
4. What are the risks of masking data?
Artefacts
It isn’t a good idea to perform data masking on a copy of your production database. A relational database isn’t designed to conceal the evidence of deletion or alteration of data: quite the contrary, any change leaves evidence in the log, in the data pages, in audit traces, and in memory. The evidence is either deliberate or incidental. Microsoft still have no way of guaranteeing that a value in a table can be deleted without any trace. Ideally, the data must be imported into the database after masking, but this will take time and can only be effective if it is part of a synchronization process. File-based obfuscation, where the data is masked outside the database in a ‘transport’ form such as CSV is often practiced as a safer approach, but the data import process can take too long to be practicable.
De-anonymization or Un-masking
Sadly, data can be at least partially reconstructed from a partial masking. Privacy experts have published several techniques of combining data from several ‘anonymized’ sources to identify individuals, and reveal sensitive information about them.
5. Why would I ever need to mask data?
Masked data is sometimes termed ‘obfuscated’ or ‘pseudonymized’. You generally need masked data to test any data-driven applications but also to do staff training, to provide data for scientific research, for doing analysis on company data, for application development, bug fixing, and performance-optimization. Where several different parts of the organization use a similar application to access the information, any information that they don’t require for their role should be masked ‘dynamically’. This is best done within the database so that the administration can be done centrally.
The reason that you need to mask the data is that you are likely to hold commercially-sensitive or private data such as PII, SPI, PID, or SPD (see glossary) in the database and so there will be data that it is either illegal or irresponsible to make use of in ways that aren’t strictly controlled. In most countries whose privacy laws derive from OECD (Organisation for Economic Co-operation and Development) privacy principles, all information from which the person’s identity is ‘reasonably ascertainable’ must be secured.
There are several varieties of data that needs to be kept secure: personally sensitive information and commercially-sensitive information about organizations. A list of an organization’s customers may do little to infringe their privacy, but it could hurt the organization if a competitor organization should obtain it.
Some of the uses of masked data, such as research or reporting, are obvious. The information that is being researched must still be there. Given the problems with data masking it would seem easier to generate entirely random data for the purposes of application development and testing, but two problems prevent this being effective: the ‘Mr Null’ or ‘Mr O’Brien’ problem (perfectly reasonable real data that can cause a bug to show itself), and the problem of the distribution of data matching as closely as possible the live data.
6. Is masking an effective way of allowing us to use real data?
Not really; not by itself. You can still get data leakage. Several re-identification algorithms have been used and demonstrated that allow data that has been insufficiently masked to identify individuals and revel sensitive information about them. Masked data may not directly by itself identify individuals, but will do so when used in combination with other data. Some attributes may be uniquely identifying on their own, such as social security numbers. However, an attribute, called a quasi-identifier or pseudo-identifier, can, by itself, or used in combination, be used to join with other data. To mitigate this risk, upcoming legislation states that masked data should be encrypted and subject to access controls.
7. Can’t you just tell me what data is sensitive, so I can mask it out?
There are many obvious categories of personal data that need to be masked. These categories will probably be itemized by the legislative framework that your organization has adopted and is likely to include not only name, social security number, date and place of birth, mother’s maiden name, employment record, education, medical information, biometric records or credit card information, but also data revealing health, racial or ethnic origin, political opinions, religious or philosophical beliefs, trade union membership, etc. It is wrong to think that the only problem is that the data itself is sensitive; it is rather the uses that the data is put to. The same data may be sensitive in one context but not another. If, for example, I can discern through statistical analysis of freely-available data such private matters about individuals such as sexual orientation, propensity for disease, or impulsiveness through their shopping habits, then privacy has still been compromised.
8. Why didn’t previous generations of application developers have this problem?
For development work, they had a similar problem but solved it in a different way. All that has changed recently in the nature of the problem is that it is now easier to transport and misuse vast amounts of data.
Personnel conducting work on applications that required access to operate with the information contained in the production data had to have security clearance and work in the production environment. In large corporations, banks, and governments, testing of database applications, and some of the development work for bug-fixing, were done by a small group of experts who were security-checked. I have twice experienced this myself in my youth. In one case, the organization employed a retired police detective to check on the background of staff who needed access, and several people who had known me for years told me they had been contacted to check on me. In short, all possible steps were taken by the organization to be sure of the integrity of the experts who were required to test on live data. They conducted their work in a data center, under data center rules, supervised by security people who were allowed to search us. The risk was minimized, but the deployment of new releases was subject to inevitable delay due to the handover process.
In short, the development problem was solved by restricting access to the live data to only the bare essential required for testing and bug-fixing.
Glossary of Sensitive and Personal Data Terms
PII |
Personally identifiable information (United States) |
SPI |
Sensitive personal information |
PID |
Personally identifiable data (Europe) |
PSD |
Personally sensitive data |
SPD |
|
CSD |
Commercially sensitive data |
CSI |
Commercially sensitive information |
References
- Data Master Row-Internal Synchronization Rules
- The Parodist: A SQL Server Application
- Unmasking the Dynamic Data Masking
- De-anonymization
- How Real a Threat Is “De-Anonymization”?
- Hello, I’m Mr. Null. My Name Makes Me Invisible to Computers
- Computers Have Loads O’ Trouble with Apostrophes
- Why pseudonymization is not the silver bullet for GDPR
- Data Loss vs. Data Leakage Prevention: What’s the Difference?
- Pseudonymization and the Inference Attack
- Only You, Your Doctor, and Many Others May Know
- De-anonymizing South Korean Resident Registration Numbers Shared in Prescription Data
- The regime for processing special categories of data has not changed, or has it?
Load comments