Data Masking in Principle
This is the first of two articles to describe the principles and practicalities of masking data in databases. It explains why an organization sometimes needs masked data, the various forms of masked data we can use, the sort of data that needs to be masked, and the potential pitfalls.
What is Data Masking?
Data masking is, put simply, the process of deliberately making the data ‘incorrect’. This seems as strange as cooking with a sauce that renders the food inedible, but there are always times when organisations need masked data. More accurately, data masking, sometimes called data sanitization or data protection, refers to the technology and processes that are used to disguise the truth of the data. We disguise or mask data in order to protect sensitive information, while providing a functional alternative when real data is not needed or where it must be impossible to relate data to real people.
This article describes the principles of data masking and when and why we need it. Data Masking in Practice describes the practicalities of implementing data masking in complex relational databases.
Why mask data?
Where organisations hold data, they take on the responsibility of becoming the custodians of the data. This means that they are obliged to protect it. The custodian of data has a difficult balancing act between conforming with privacy laws, discretion and the obligations of confidentiality, on one hand, and the need to make certain parts of that data available for development work, training, testing, medical research, open government, security, and the law. The data professional is faced with the practical task of providing data, legally, in a way that cannot possibly reveal, by inference attacks both private information about individuals or confidential business information.
Sometimes, there is often a limited opportunity for disguising data, especially if you are required to provide a data extract for medical research or legal evidence. For medical, and particularly epidemiological, research it is often vital that the medical information and locations are correct. For example, if the long-term effects of a nuclear reactor accident are being investigated, the locations cannot be masked. We cannot tamper at all with information that is required for court proceedings. Conversely, data for staff training need only be as realistic possible to casual inspection. Generally, data must be masked to fit the purpose for which the data is required.
Why is masked data required, within an organization?
Databases generally do more than just serve up data to applications in response to queries. Were life this simple, one could deal with the responsibilities of curating personal or sensitive data simply by implementing access controls in such a way that your users could only access that part of the data that they needed for their role within the organisation. You get to know only what you need to know.
The trouble is that most corporate and organisational databases provide data for downstream systems. By ‘downstream systems’ I mean the organization’s reporting, supervision, and management systems that rely on updating their data from this database. There are likely to be people analyzing the organization’s data for trends and relationships to help with planning, procurement and so on. To do their work, extracts and reports will be required and every copy of the data must be kept under the same strict security and control regime as the database.
Data is requested for a variety of reasons, and no single masking solution will cover such diverse uses as user-training, sales demos, legal reports, audit, research, and application software testing. These are likely to require a variety of strategies and techniques of masking. User training or sales demos do not need more than verisimilitude, whereas performance testing will require a volume and distribution of data that is similar to the real data. Data for research must provide data in which it is impossible to identify any of the individual people to whom the data relates. These various objectives are so diverse as to require different ways of disguising the data.
Minimizing data extracts
The best way to protect sensitive data is simply not to include it in the data set that you wish to distribute. Although many parts of the organisation will insist that they need all the data, I’ve never encountered a case where this has turned out, in retrospect, to be true. In fact, they usually need just a subset of columns and usually they want only aggregated, or summarized, data and quite often, it turns out that there is no need to protect the data that is really needed.
Many requirements for masked data, such as for distributing reports publicly, can be accommodated far better by creating a view, or set of views, that gives the required data without any of the unnecessary sensitive data fields. Why extract a credit card number, for example, if it must be obfuscated anyway? Why disguise people’s names when they need not be in the data extract?
Aggregation is another obvious way of minimising the risks of ‘leakage’ of data, without the need for data masking. The more that data is aggregated, the less likely it is that the personal data of individuals can be extracted. If, for example, the purchasing department need to know of trends in current sales to predict how best to maintain stock levels, it is better to do as much of the trend analysis as possible in the database because the data will hopefully be under carefully planned access controls. It helps to keep the surface-area of data protection manageable. Aggregated data, where daily sales are grouped by age-range, gender, location or whatever else, is much safer, and provides just as much information.
The use of aggregation allows you to use the real data because it becomes very unlikely that personal or sensitive data can be exposed, as long as the count for each aggregation point is more than a handful. Aggregation isn’t ever going to eliminate the need for pseudonymized data (see the later section, The basic types of masked data), because this allows researchers to do more exploratory analysis of data. If, for example, you don’t know the predictive factors for trends, or if you are lucky cause and effect, you’d want to have all the variables and do multivariate analysis to work out the relationships. In medical research, you might need to track down the reasons for progressive lung diseases without a preconceived idea as to the cause, and unless you had the whole data including address, you might miss a strong correlation with proximity to a particular industrial process, or road system.
Where researchers already know the variables, they wish to study, you wouldn’t need to provide any sort of masked data. The researchers can safely be provided with correlation matrices from which multivariate factor analysis can be done.
Data masking in DevOps processes
In any database Continuous Integration (CI) process, there comes a point where you need data that is as close in nature to production data as possible. However cleverly you create entirely fictional development data, there comes a point during the release process where you need to check that there are no unexpected hitches such as an indexing problem. You can never expect to create spoof data with a data distribution identical to the actual data. If the distribution of data is different, so too would the query plans that are generated by the database. Standard practice is to do this check with real data under full security, either in Staging or during a specialized performance test cycle.
The most usual precaution is to use operational staff with sufficient security clearance to use production data, but this is sometimes insufficient. For a DevOps-based CI process, it is quicker to fix a problem during the release process with a patch using development skills and knowledge of the system. To do this, you must have a mechanism in place that cleans the data sufficiently before it is required for the CI process.
There are a few different ways of doing this, but they most often involve a separate process that masks the production data and then backs up the masked database for use within the CI process. This process will need to be repeated, most easily via an automated process. The data making process will modify or delete data as needed to meet any regulatory requirements. You need to remove any risk that your CI databases could still have sensitive information within them and could expose that information to unauthorized people as part of a subsequent deployment process. All the data cleansing mechanisms and processes should be developed by individuals within the organization who have appropriate permission to view production data.
The basic types of masked data
There are two different types of masking that are referred to in the legislative frameworks, Pseudonymization and anonymization.
Pseudonymization attempts to maintain a part of the data, while anonymizing the directly identifying data elements in a consistent way, to allow the data to be used for meaningful reporting or analysis without revealing individual data. For example, pseudonymisation of a database will alter the name and other identifiers of individuals but will leave the rest of the data, such as maybe shopping history or medical interventions intact.
Anonymized data is simply data from which individuals, the ‘data subject’ can no longer be identified. An anonymization process will make it impossible, or at least extremely impractical, to identify the data subject, while also attempting to maintain the overall verisimilitude of the data, so that it looks real. This means that it isn’t enough to just mask the directly identifying data elements such as a person’s name or ID. It requires additional measures to prevent identification, which will vary depending on the data and why you need to anonymize it but will generally involve shuffling, scrambling or otherwise change the relationships between the people and the rest of the data. For example, the shopping habits will be changed as well as the names, or the medical interventions will be assigned to other individuals.
The advantage of pseudonymization is that all the relational links remain intact, and the distribution of data is guaranteed to be like the real data. The disadvantage is that it comes with obvious security implications that will limit its use outside of environments with security protocols matching those in place for the live data. With anonymization, the resulting data is safe for use but if you need to start fiddling with the FOREIGN
KEY
references to prevent inference attack, it becomes harder to maintain the correct distribution.
With pseudonymised data you attempt to remove all data that could identify an individual. You might think that masking out the name would suffice, but a skilful analyst can use an ‘inference attack’ or ‘triangulation’ to identify individuals. This is why the GDPR still insists on pseudonymized data having the same precautions, such as encryption, as the unmasked data, and why it makes sense for developers to use anonymized for creating, maintaining and testing data applications that have personal or sensitive data in it.
What sort of data needs to be masked?
Although personally identifiable data is currently the focus of legislation, there are plenty of other types of data that need to be masked, before it can be made public, such as business and financial information, or information that can aid the carrying out of a criminal offense. The organization that processes or stores the data must determine which data elements need to be masked, and how, based on the legislative framework within which it operates.
Personally identifiable information (PII) is common to most data masking requirements. PII is any data that can be used to identify a living person, and includes such elements as name, date of birth, National Identification Number, address details, phone numbers or email addresses, disabilities, gender identity or sexual orientation, court orders, electronic wage slips, union affiliations, biometric and ‘distinguishing feature’ information, references to the serial number of devices such as laptops that are associated with, or assigned to, a person.
However, though a single data set can be successfully anonymized, the more data sets that are combined, the easier it is to identify individuals. An example of this is data that identifies where an individual was at a given time. This can be a hotel or travel booking, internet access record, meeting appointment, or GPS data. By combining two or more of these data sets, we can often identify a record as belonging to an individual.
To protect data, we need to know in which columns, of which tables, and in which databases, it is held. Data masking is relatively easy when each item of data is held in just one column, but real life isn’t like that. If the data has been denormalized, to speed up reporting, then we must find every table to which the data we want to mask has been added.
What can go wrong?
It is a frighteningly complex task to anonymize or pseudonymize the data of an entire database, to do development or testing. While it is relatively simple to pseudonymize the contents of a single table, or even a handful of tables, so that the real data cannot be gleaned by casual inspection, it is more difficult to achieve a robust pseudonymization that is proof against decoding by a skilled person, and there have been many cases where the personal information about individuals has been extracted from public, ‘pseudonymized’ data. You also need to mask data that, by itself, seems innocuous but, when used with other publicly available data, can identify individuals.
Any sort of compromise that leaves some of the real data in place can cause a breach, such as via an inference attack, and so will lead to restrictions on its use. According to the GDPR guidelines, pseudonymized data is best thought of as weakly-encrypted data that can only be used in the same context as the real data:
Recital 26 GDPR
‘Personal data which have undergone pseudonymization, which could be attributed to a natural person by the use of additional information, should be considered to be information on an identifiable natural person’
This is why pseudonymized data must be kept separately under secure conditions and is subject to technical and organizational measures to ensure legal compliance.
The alternative is to anonymize the data. However, even with a dataset that has been through an anonymization process, there is always a risk of being able to combine the unchanged data with other datasets to make possible or probable identifications of individuals or small groups. If it is possible to do such an ‘inference attack’ to identify even one individual, it is, by definition, not anonymized. Anonymized data can be used for development work, because no personal or sensitive data can be gleaned from it. However, it doesn’t meet many development requirements unless the masked data also retains approximately the same distribution and characteristics as the original data.
Databases aren’t designed to be able to erase data without trace. Quite the contrary; databases carefully log every change. They also are happy to leave old data on the data pages until the space is required by fresh data. If you are intent on producing an anonymized version of a database, the safest approach is to entirely re-create the database by build and bulk insert, or to use Common Criteria Evaluation Assurance Level 4+ (EAL4+) with SQL Server Enterprise Edition.
Summary
The safest data protection policy for personal or sensitive data to is simply to exclude it from the sample, or to use aggregation to minimize the risk that the personal data of individuals can be extracted. However, there are times, such as during Database DevOps processes such as CI, when we need to use data masking techniques to ‘pseudonymize’ the data. This allows us to minimize the risk of data leakage, while retaining the characteristics and distribution of the real data.
However, even small extracts of data need to be created with caution, if they are for public consumption, whether for reporting, to create sample data for DevOps processes, or for training and demonstration purposes. A lot can go wrong, and sensitive data can ‘hide’ in unexpected places.
Having covered the principles of data masking, in this article, the next article, Data Masking in Practice tackles the practicalities. It examines masking techniques and the practical difficulties of masking a whole database safely and effectively, while retaining its referential integrity, and distribution characteristics.
Read the next article: Data Masking in Practice.