One question that frequently comes up when discussing the software development lifecycle for databases is whether encryption is sufficient to protect sensitive data in non-production environments.
While it’s appropriate to encrypt some data in your development environments, this is only a subset of sensitive data. Some types of sensitive data require a different type of handling in the SDLC.
In this post, I give a quick overview of encryption, de-identification / masking, and synthetic data, along with notes on the pros and cons of each approach.
What is encryption, and where is it useful?
A few kinds of encryption are popularly used across database environments. Since these methods of encryption provide different types of protection, multiple types of encryption are often used in combination with one other. Here’s a high-level overview of three popular types of database encryption:
|What it does||Gotchas|
|Encryption “at rest”||Prevents someone with access to the file system from reading the data in plain text||Anyone with database access can read the data, it is not encrypted at that level|
|Backup encryption||Prevents someone with access to database backup files from reading the data in plain text||Anyone with database access can read the data, it is not encrypted at that level|
|Column level encryption||Prevents someone with access to the column in the database from reading the data in plain text||Changes how the data may be accessed in the database compared to non-encrypted columns / heavily impacts query behavior / performance|
Often, when people think of encrypting sensitive data in the SDLC, they are thinking of a form of column-level encryption. One risk with development environments in the SDLC is that often these environments grant access to a wide variety of employees and contractors, which can allow access to sensitive data.
Why not just encrypt it all?
The answer lies in the gotcha: column-level encryption changes the manner in which a database may access the data in that column.
We should encrypt some data in development environments: if we encrypt data in production, it should also be encrypted in other environments so that the databases in all environments perform similarly, allowing accurate testing and predictable deployments.
The types of data which commonly use column-level encryption include passwords and very sensitive identification numbers such as Social Security Numbers in the United States. These columns generally contain data which is a “secret” that is looked up by another value — such as using a login name to return the value for a password associated with that login. (In truth, a lot of these “secrets” are actually hashed in a database rather than encrypted, which is why you may get a notification to change your password from a vendor who has had a data breach even if they claim to have “encrypted” the data – see Troy Hunt’s excellent article for more detail.)
There is additional sensitive data which is typically NOT encrypted in production environments because encryption would impact performance and limit the type of queries which could be run against these columns.
This sensitive data includes personally identifying information (PII), which typically is defined to include things like names, physical addresses, email addresses, financial information, and more. In production environments, this data is often protected with a combination of encryption at rest, backup protection, and carefully limiting who may access the data in the database (principle of least permissions).
In sandbox, dev, and test environments, broader access to the databases is required — so we need additional tools
We need to protect PII and other sensitive data which is not suitable for encryption in all our non-production environments. There are multiple reasons to protect this data:
- Regulations such as GDPR, HIPAA, the CCPA, and more require it
- Standards such as NIST 800-53 in the United States recommend it (and will soon require it for some US government entities)
- It protects your organization from brand-damaging data breaches
- It’s the right thing to do for your customers
What is de-identification / masking, and how is it different from encryption?
The terms “data masking” and “de-identification” are roughly equivalent. To mask or de-identify typically means to replace sensitive data with values that have similar characteristics. These replacement values may be names, for example. The names may be generated from a separate dataset than from the production database, and additionally the replacement values may be shuffled so that the replacement names are associated with different replacement addresses than the original data held.
The combination of using replacement values with additional techniques such as shuffling helps ensure that the masked dataset cannot be easily reverse engineered.
Generating masked data comes with a few technical challenges. The challenges which Brent Ozar outlined here are all still relevant. A high-level summary is that you need to:
- Maintain the storage profile / have a way to replace values with similar characteristics
- Maintain distribution patterns of the data (without making it easy to reverse engineer)
- Keep referential integrity across tables
- Develop a repeatable process to apply the masking which can run quickly
What about those encrypted columns? We don’t need to mask those, right?
Well… actually you probably should, but maybe not in the order you are thinking of.
When data is sensitive enough to have a technique like columnar encryption applied in production, you generally want to replace the contents of those columns with masked data (say a set of randomly generated passwords), and then to encrypt those replaced passwords using the same encryption method you do in production — but with a separate encryption key, etc. that is specific to the non-production environment.
In cases where some sensitive data is not needed for development, customers often choose to subset the data. This means that they do not include that data at all in that environment — in other words it is completely removed from the database. Subsetting is often used in combination with de-identification / masking.
Can you do research or analytics with de-identified data?
Yes, you can absolutely apply both subsetting and de-identification techniques to data which will be used by analysts.
The key here is to understand clearly which parts of the data are to be used for analysis and how they will be used. By understanding this, you can ensure that the masking techniques you apply maintains the distribution and content of the data in ways that provides accurate analysis, while still protecting the data.
In this case, a close collaboration between the team doing the masking and team members who will be performing the analysis is required.
If this level of close collaboration isn’t possible, then teams sometimes choose to develop and test tools to perform the research and analysis against a de-identified dataset, and then another team who has permission to the raw data executes the tooling to perform the analysis against the sensitive data.
What is synthetic data, and when should it be used?
In some organizations, production data is simply not allowed to be used in development and test environments. This policy is sometimes established in highly regulated industries for very sensitive data.
For these organizations, the only option available for development and test environments is to generate data from scratch, with no reference to the production dataset.
While automation and tooling may speed up the ability to generate this data, these organizations often face challenges in creating synthetic datasets which support quality testing. This is because of reasons similar to the challenges with masking/ deidentification. The synthetic data:
- Needs to have a similar storage profile to the production data (but be produced without reference to it– very challenging)
- Have similar distribution patterns to production data (again, it can’t be created with reference to that data)
- Have similar relationships across tables as the production data
- Change in scale and distribution over time as the production dataset changes
As these are nearly impossible tasks to achieve for synthetic data, it is much more difficult for developers to understand how changes will perform prior to deployment in these environments.
Where to go from here
If you’d like more information on data protection, we have a lot of ways to learn more:
- Watch the “Implementing Data Masking for NIST Compliance” webinar with myself and Chris Unwin (on-demand)
- Read about “Notifiable Data Breaches — and how to avoid them“
- Check out more of our articles on Audit and Compliance
Was this article helpful?