Data cataloging: A giraffe’s eye view

In the latest DBAle podcast episode our hosts, Chris and Chris, tackled what they really mean by cataloging a database and how taking a ‘giraffe’s eye view’ approach to compliance is not enough.

That’s because the most common data concerns for managers and execs center around where your sensitive data is, the risk it poses and, if you’re migrating to the cloud, how much sensitive data is being stored on that database or that instance.

To answer these concerns, a lot of people take a giraffe’s eye view where you’re way, way, way up, and you’re looking down and the only thing you see is grass. And yes, that is all grass down there, but the problem is you don’t see the gaps.

Finding your data citizenship, as one of the major data classification tools calls it, is key. It’s about being able to respond to the question ‘Where is your data?’ by knowing everything about your data landscape. That’s not an easy thing to do because there is so much to find – everything from Excel spreadsheets through to emails and text files. Even that one time you took a screenshot and dumped it into an email to yourself.

There are a billion ways that these things can go wrong, so you need to have the right way of doing it. A common drawback is that a lot of people worry about the stuff that is easily missed like spreadsheets and documents but rarely pay attention to the thing that’s staring them in the face.

Let’s talk about your databases

Back in the days before computers, you’d receive a document or letter, you’d classify it, and then store it in a filing cabinet as a physical record. In the digital world, data that comes in is stored in a database instead, with what used to be a record now appearing as in a row in that database.

Yet a lot of the technologies for structured content still have a records management view of the world whereby you take everything, you classify it, you keep it as a record. All of which defines how that record is kept. For databases, that just doesn’t work. You can’t take a row in a database as a record and maintain it in the same way as the past, because you suddenly find that you’ve got a million records just from one database.

Rather than figuring out row by row which records are sensitive, it makes sense instead to look at it from a column point of view. Each column, like surname, zip code, etc, will contain the same type of data so it’s fair to say that a date of birth column is going to be classified and tagged as sensitive, whichever record it belongs to.

Cataloging a database is specifically around tagging columns, which is far easier than trying to go row by row. However, there are still a lot of columns across tables, across databases, and across instances. It’s important to do this process in full because once you have all the columns classified and tagged, you can look at a database in a single glance and know approximately what percentage of the columns in the database are sensitive.

As you can see, what you need is both the giraffe’s eye view to help manage your wider estate, but also visibility around the database itself. And that’s about classifying and tagging columns, and then ensuring you keep it up to date. The good thing is that when it fits into organizational compliance processes, things look very different and people stop fearing compliance.

If a developer makes a change to a table in the development environment, there should be a process as part of the development work that allows them to easily flag that this new column will store sensitive information. If you have a side administrative process like updating an Excel spreadsheet, people forget or promise they’ll do it another time and it’s going to fall apart. But if your deployment of that work hinges on you completing the necessary processes including classifying any new columns, you hit a point that we call ongoing organizational compliance.

Redgate’s SQL Data Catalog enables you to achieve that. It’s a regulation-agnostic approach for accelerating the identification and classification of sensitive data that you can build into your development processes.

Until now, it’s been missing the ability to tell what kind of data is in a column by looking at the data itself, rather than just relying on the column name. Having the ability to natively examine data and predict that a column contains dates of birth, for example, would be a great piece of functionality. This would reduce the reliance on column names alone, which can sometimes have titles relating to how the database schema was developed, like ‘NewColumn36’, and ‘Feature4d’.

But data scanning is a very difficult problem to solve and throws up a lot of questions. Who do you surface this information to? How do you get people to confirm if data is sensitive? What algorithms do you use? How do you make sure people know their data is safe? Do you enable it by default?

To help out, the development team behind SQL Data Catalog has created a data scanning capability for the tool. When connected to a database, it will look at the data stored within your tables and try to match it up with a subset of different data types. It will then go ahead and predict the data each column contains. And it does all these predictions natively on your infrastructure, without storing any of the data used to make the predictions so that you know your data is safe.

This is huge functionality to get at this stage and is paving the way for more configurability. This massive step forward will help people not only classify and catalog their data faster and easier, but also stay classified through ongoing organizational compliance.

If you’d like a more in-depth look at data scanning using SQL Data Catalog, check out our article, Using data scanning to identify and classify sensitive columns.

If you’re new to SQL Data Catalog and would like to see how it rapidly accelerates the data classification process, you can also find out more, and download a fully-functional 28-day free trial.

 

Tools in this post

SQL Data Catalog

Accelerate identification and classification of sensitive data

Find out more