26 July 2017

2 Comments

26 July 2017

2 Comments

Removing sensitive data from a database

Guest post

This is a guest post from Coeo. Europe's most trusted analytics and data management expert, Coeo is the number one provider of database strategy in the Retail, Financial services and Gaming industries, and delivers technology strategy and support for businesses who need to get the most from their data.

The Coeo team hold more Microsoft certifications than any other data platform specialist in Europe and are passionate about sharing their knowledge and expertise to help customers become industry leaders.

Not too long ago I was assisting a client with an issue they were experiencing while running some compliance software to check for sensitive data. Sensitive data was picked up in the SQL Server data file but when the DBAs checked, the data was nowhere to be found in the database.

It’s fairly widely known that just because you delete data from a database, it doesn’t mean it’s gone from the data file itself, but the question was, how should they go about removing it? They had already tried shrinking the file and executing sp_clean_db_free_space but it had made no difference.

Deleting rows vs dropping columns

The procedure sp_clean_db_free_space removes ghost records; rows that have been marked as deleted but not physically removed. This process runs periodically anyway but executing this stored procedure will manually set off the clean-up. You can check how many ghost records you have by running the query below. If it’s high, be careful as there was a bug in earlier versions of SQL Server that meant the automatic clean-up was not working.

In this case, however, the ghost data was low (less than 100) and the amount of sensitive data was large compared to it. It transpired that the data was in a column that was dropped, and data removed by dropping a column behaves very differently than when a row is deleted. Because the data in a table is stored in rows, marking a whole row for removal is much easier than marking a column in every row. When you drop columns it can be quick, almost instant, because it is only marked for removal at the meta data level. As it’s done in this way, the data isn’t marked as a ghost record.

The options

If you have the Enterprise Edition of SQL Server, you can enable (and then disable if you wish) TDE, which is a good option to wipe the data out. Enabling TDE causes all the pages to be read and encrypted, which won’t account for data outside of this file but will do a pretty good job as far as the database is concerned.

If you don’t have Enterprise Edition, there are other options. The first would be to select all the data into a new table and then drop the old one. The new table could then be renamed back to the original one. This will again leave ghost records but these will be cleaned out.

A slightly less inconvenient option would be to rebuild the clustered index before running sp_clean_db_free_space. This appears to remove the data in most situations and also applies to rows that have been deleted that appeared in a non-clustered index. During the rebuild, all the pages are read/written and there is plenty of data movement, so it makes sense this would work. In the case of this particular client, the column had been dropped from a heap table so a clustered index had to be added and then dropped, but with the same outcome. Unfortunately, if you don’t know the original table this would mean rebuilding all of them before running the clean (unless you’re a whiz at reading MDFs).

While this method did work in this particular case, it’s not guaranteed and there is always a remote possibility a page is not moved or rewritten and the data stays in place. It does however highlight the way the internals are handled when dropping data. This also does not cover the rest of the disk the database sits on; if you were to drop the database or remove a file, there may be residual data here too and the only way to do it would be to scrub the device securely.

Example

Below is the script if you’d like to try these out for yourself. To look for the data in the mdf you need to take the database offline then copy the file somewhere to open it. This example was run in SQL Server 2016 but it should work in earlier versions too.

 

Coeo is a Microsoft Gold Partner for Data Platform, Data Analytics and Cloud Platform.
We work collaboratively to improve project outcomes and develop deeper knowledge
and appreciation of Microsoft technologies, improving our customers’ experience.
Find out more about us.

Guest post

This is a guest post from Coeo. Europe's most trusted analytics and data management expert, Coeo is the number one provider of database strategy in the Retail, Financial services and Gaming industries, and delivers technology strategy and support for businesses who need to get the most from their data.

The Coeo team hold more Microsoft certifications than any other data platform specialist in Europe and are passionate about sharing their knowledge and expertise to help customers become industry leaders.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

Related posts

Also in Blog

Down Tools Week 2017

Once a year, we hold Down Tools Week at Redgate. It’s our version of a hack week – a chance to put down the day job and spend a week working on something completely different. It’s an opportuni...

Also in Software development

How might classification and better documentation improve data safety?

In this post, we imagine how auto-classification of data can be used to build better documentation that helps you trust that your organization can use data without posing a risk or compromising regu...

  • –Jeff Moden

    You’re a Microsoft “Gold Partner”. With that, consider setting a good example when it comes to code. For example, your team should learn things like how to determine if a database already exists so that you can conditionally drop it to rebuild it, learn how to use EXEC instead of GO when it comes to executing stored procedures, stop using GO as if it were the period at the end of a sentence, understand the importance of using 2 part naming conventions, and why you should use semi-colons in all the right places at all times, to name a few of the more important ones.

  • Perry Whittle

    Why are adding a clustered index and then dropping it?
    From 2008 onwards sql server applies the rebuild clause to heap tables too