Since 2004 I’ve helped more than 800 companies incorporate data masking into their database deployment strategy for non-production databases, mainly using the Data Masker tool, which is part of SQL Provision. The goals are always the same. First, understand what sensitive data, or Personally Identifying Information (PII), exists in the source database. This might not be as obvious as you think!
Second, devise a repeatable, automated process that will ensure that the data is always protected, and is masked and obfuscated such that it can’t be used to identify the data subject, when it is used outside the secure environment of Production.
Third, achieve all this while still ensuring that the data remains fit for its intended purpose. Cleansed and anonymized data is of limited use to the Testing or BI functions in an organization, for example, if the masking process fails to retain any of the basic characteristics and distribution of the original data.
This article explores what’s required for the effective masking of dates and times in your data. Subsequent articles will cover masking strategies for a whole range of PII data.
Why start with dates?
Certain data is easy to recognize as “PII”, such as date of birth, National Identification Number, home address and contact number. However, there is a lot more to consider than that. Even an innocuous looking date, related to a person’s activities, could be used to identify that person if it “puts them at a particular place at a point in time”. For instance, what if the obvious PII data elements are masked, but the following are still available:
- The day and time of an appointment
- The address of the appointment location
- The reason for the appointment
- Who it was with
- …and any follow-on booking
Such data can conspire to identify a person through triangulation and you’ll find many discussions about this with a simple Google search. Here’s a high-profile example which I can share with confidence. Back in 2015, a well-known British comedian was photographed by a London Metropolitan Police helicopter as he stepped outside his hotel with his PA.
I remember this example well, mainly because of my reaction to the subsequent statements issued that denied that this was Personally Identifiable Information (PII), and so therefore not under the auspices of the UK Data Protection Act.
I confess, I shouted out loud at the news feed!
“That’s M*****l M*******e! I recognize his hair! You’ve just published a picture of him with time and location data. Of course, it’s PII!!!”
In short, all PII must be masked and obfuscated to the point that it is no longer possible to directly identify a data subject from the data. This includes a strategy for masking dates and not just obvious PII such as dates of birth but any date that, along with other information, could be used to place a person at a specific place, at a point in time.
Masking date columns in databases
Moving on with a calmer head, let’s consider a different example; a relational table that contains medical records, and includes several date columns.
Let’s assume for a second that we’ve already masked the “obvious” PII data; those aren’t the data subjects’ real medical numbers or names. The question, again, is: can this data now be safely distributed to non-production environments?
I’d argue it can’t; it depends what else I know about the data. If I know these are medical records for a certain state or area, and I guess the event dates are hospital visits, then I could potentially still identify data subjects by correlating the time with a location. A Simple-Talk article reports how a hacker was able to identify the medical records of the Governor of Massachusetts, by correlating leaked, partially masked, medical data with newspaper articles that reported details of hospital visits (due to accidents, muggings and so on).
What does this mean in practice regarding date masking? As noted earlier, if we simply substitute all date column values with random dates, then we will fail to retain any semblance of a realistic date distribution and we destroy the time interval between those dates, which may be critical for testing response-time data.
What’s the legislative guidance?
Most people will, by now, be aware of the GDPR legislation which, in EU states and the United Kingdom, comes into full force on May 25th, 2018. Enforcing bodies can impose substantial fines if your organization fails to take all appropriate measures to protect any personal and sensitive data that the organization stores or processes. In addition, there are many sets of legislation that offer specific guidelines for what is required to safeguard specific types of databases, such as medical or financial data.
However, how can we be sure that out processes for protecting data, outside production, will comply with such legislation? For medical data, the United States provides a set of legislation called HIPAA (Health Information Portability and Accountability Act), which gives guidelines on the key data content to consider when masking and obfuscating data, to achieve “Safe Harbor” practices.
From a data masking perspective, this means that you build the rules to achieve the Safe Harbor objectives, save them for re-use as part of the non-production database deployment workflow but then be sure to instrument a practice that identifies new or deprecated items in the database and considers them as part of the Change Control process.
Fortunately, HIPAA is one of the few places where I’ve seen date considerations clearly, or simply, outlined. We have a whitepaper, which covers these compliance points. Let’s see how we go about implementing a data masking strategy for date columns, which both ensures that our database deployment processes comply with these regulations, and that the resulting masked data set remains fit for development and testing purposes.
Implementing a data masking strategy for dates
Let’s return to our simple table of medical data.
Our obligation is to mask the date columns such that individuals cannot be identified. For the
Person_DOB column, for example, for compliance we’ll need to mask the month and day components for all cases, and for anyone over 89 years of age, we’ll also need to mask the year. The “Event” date columns will need to be masked that they cannot place an individual at a certain place and time.
Our challenge is to mask the dates in such a way that the relationship between them remains consistent. If we’re trying to develop a report that highlights a shortcoming in a service provision, within a specified timeline, then we’ll fail hopelessly if our start data content has been randomly replaced. Also, many tables have a column that timestamps each row (
MeasurementDate, and so on), and then we see queries with
WHERE OrderDate >= @ThresholdDate predicate. If your date distribution is “random” you’ll have no idea if query performance characteristics seen in development will match production.
It’s for this reason that Data Masker provides a range of date management techniques, which allow you to change the date values to satisfy regulatory requirements, but do it in a correlated or consistent manner, retaining the true date relationships.
You can see these techniques in the Data Masker replacement datasets list:
As a quick example, I’ll set a rule to randomly vary each of the columns in this table in a correlated manner, using a random date variance value between -53 days and +65 days. I’ve set the bounds to be consistent for each column and that means that if one column is varied by, for instance, “-5” then every other column in the row will move the same way.
And the result is:
All the dates have moved in that correlated (or consistent) manner. Just what we need to keep our data demographics intact!
Delivering masked databases downstream means ensuring that the resulting data is masked but still credible and therefore usable. All too often there’s a reluctance to use masked data since it’s perceived to be just X’s and N’s and therefore not useable. That’s not the case if you use Redgate’s Data Masker tool.