What is a Data Catalog?
A data catalog allows an organization to discover and record the facts about its data, where that data is held and how it used. William Brewer explains the details.
The two biggest challenges in data management are centered around data catalogs—finding and identifying data that delivers value, and supporting data governance and data security.”- Gartner Data Management Strategy Survey 2017
…The knowledge about which data exists and its ownership is not always obvious. As a result, data is hardly used beyond its original context, and many opportunities to create value from data remain unused. – Deloitte: Cataloguing Data 2018
A data catalog is a comprehensive inventory of all the data assets that are being held by an organization. It is maintained through the discovery, description, and organization of distributed datasets. It is an important part of data governance because it is essential to know what data is being used, where and how, before you can ensure that it is all done properly. A data catalog is required to check whether any of the data that is being held requires special precautions, and to determine all the places where the data is held. This knowledge is also a precondition for the introduction of the Master Data Management discipline, throughout an organization.
All organizations must comply with legislation that requires that personal and sensitive data is held, managed and handled in a responsible way, and that it is retained for the correct period of time. Any organization is obliged to be able to demonstrate that it knows where and how data is held, and for how long. A data catalog also provides some of the raw evidence for any forensics after a data breach, and therefore its data must be held in a manner that is auditable and can be used as evidence.
A data catalog is also useful for anyone with within the organization tasked with reporting business information about the organization and its activities, or with data management, searching, data inventory, and data evaluation. This will include data stewards, data/business analysts, data engineers and data scientists. A data catalog supports these business functions by providing the context required to find and understand relevant datasets, for the purpose of extracting business value from them.
The maintenance of data catalogs and data models has for several decades been part of the responsibilities of IT departments, within commercial organizations. Even where a data model and data catalog are maintained, as part of the IT function, there are good reasons, such as mergers, acquisitions, and reorganizations that call for them to be redone or updated. The benefits of doing so are wide-ranging. It allows information analysts to search through all an organization’s available data assets to locate the most appropriate data for their analytical or business purposes. It prevents the duplication of data collection by several departments or activities within the organization, and it prevents errors and misunderstandings when data is misinterpreted. It provides an inventory of distributed data assets, helps in managing data sprawl and provides mapping information for supply chains.
From data model to data catalog
A data model provides a high-level view of the way that data entities relate, and documents how and why they are used within an organization. It will be concerned with the flow of information with the processing transformations that take place at various points. A data catalog, in contrast, keeps low-level information about the data, what it represents, its datatypes, data constraints, realistic ranges, and the units of values that are stored. In the more structured data stores, this data about data is referred to as ‘Metadata’. However, a data catalog will also concern itself with data quality, location, and compliance. A catalog will, for example, use the information from the model to establish where data is held and whether its lifecycle complies with the current legislation on data retention. See also: The Data Catalog comes of Age
For a data model to be effective, it needs to know what data attributes are held by the data entities. A mobile phone number, or health record are attributes that might be held for a customer, contact or employee entity. Each organization is unique in the way that it does this. It needs also to know who, or more likely what roles, within the organization can access any attribute. It needs to know what data type is used to hold every attribute and what unit of measurements are used for quantities. It needs to have visibility of this ‘metadata’ however the data is held, be it in spreadsheets, databases, or documents.
Having established an up-to-date, organization-wide catalog of the data, it can then combine this information with the data model to determine all the places where the data is held, which are copies or cached information and where the master copy is located. For every location of the data, it needs to record access controls and security. It is no good having data held encrypted in a secure data center if copies are also held in office-based directories in spreadsheets. The only valid statement about data security is that which describes the precautions given to the least secure copy of the data: the weakest link.
The functions of a data catalog
There are several reasons for identifying where data is held. Likewise, there are several reasons for being able to query the system to discover what is held in the data catalog. It should provide a clear and comprehensive overview of all data assets, document the data quality, allow users to contribute to providing information about data, make it easy for the organization to use it to improve its use of data, and to ensure that data is handled and processed in compliance with any relevant legislation.
Without dwelling on the obvious, it should be easy to provide information about the data assets of an organization to data analysts, and to provide information about the type and location of any sensitive and personal data, to auditors. It should be possible to extract sample data and summary statistics.
Determining the ownership of data
It must be possible to determine the ownership and source of every data asset. It helps to be able to indicate how its ‘owner’ is able to correct errors, and to record any relevant properties of the data that would affect the way it is used, such as its rate of obsolescence. There should be enough information in the catalog to be able to quickly appraise the lineage (where it comes from and how it is processed or changed within the organization) and impact (how it is used and its value to the organization). Ideally, there will be enough information to assess data quality.
Providing information about the data
The catalog must record data datatypes and unit of measurement. A machine-stored integer value, for example, is useless unless the unit of measurement is known and used consistently by every ‘customer’ of the data.
The catalog must also describe the data, in the exact business terms used by the organization. Recording a machine-oriented storage-type such as ‘Datetime’, for example, is insufficient. It must also be recorded as ‘retention date’, ‘modification date’, ‘purchase date and time’, or however it is described in the business process using it.
The categories of data sensitivity, as well as any other ways that data is categorized by the organization, such as ‘retention type’, needs to be present, so that it is possible to get lists of any combination or filter of categories.
Recording data location
It is easy to forget the very nature of curation and miss an obvious exploit. One famously secure datastore was breached because operations staff kept passwords pinned on a noticeboard within a secure data center. That was OK until they did a video interview for a news station. Similarly, records of prosecutions include data loss through staff absent-mindedly leaving laptops on trains, or abandoning filing cabinets full of sensitive records in basements.
The first port of call for anyone who is creating or updating a data catalog is usually a database This is easily ascertained for SQL Server databases by using SQL Data Catalog. The advantage of a tool becomes obvious when the catalog must be kept updated and it is very handy for bought-in third-party databases.
However, in most organizations the database is probably their most secure data location. There is also structured (tabular) data, usually but not always in spreadsheets, to check, unstructured data such as documents, web pages, email, social media content, data feeds, mobile data, images, audio, and video. Data can be in reports and query results, data visualizations and dashboards or even machine learning models. If data is processed outside the organization, this requires separate reporting, and may involve other methods such as questionnaires.
Collaboration and governance
The data catalog should encourage multi-disciplinary teamwork and support the organization’s governance team. It may be that, when the organization’s data is catalogued, it is found to be perfectly conceived, with no confusion or redundancies. Real life experience tells a different story, as when the acceleration figures for cars from two different brands of a single Motor manufacturer were discovered to be in different units, explaining a long-held and unexplained misunderstanding about their relative performance.
A thorough data cataloguing exercise will come up with many oddities that will require collaboration, and probably some re-engineering of IT systems. It is common to find examples where two or more teams are expensively acquiring and preparing the same data. Even more worrying are the security vulnerabilities that will be uncovered, and which will require investigation by the governance team. This could involve such problems as database roles being given inappropriate access rights, such as the receptionist being able to view the salaries of all the employees (true-life example), or data being held in a location where the entire server can be removed by a passer-by, filing cabinets holding aged data stored in the unlocked basement. The Data Catalog holds the bare facts that several teams will need to work with, so it must make it as easy as possible for collaboration.
What information should be held?
It must be easy to find, harvest, and organize all the ‘data about data’ (metadata) associated with any data asset in the organization. It should also make it possible for data, security and compliance experts to curate and improve the scope of that metadata with tags, associations, ratings, annotations, and any other information and context that helps users find data faster and use it with confidence. This metadata comes in five flavors.
Database or document metadata
This is the structural metadata describing how the data is organized, indexed, and stored. It holds the structure of the data objects, such as tables or structured documents. It is concerned with columns, rows and indexes in relational databases, or the documents within a collection and their schema-validation in structured documents such as XML or JSON. Data may be in spreadsheets or more simply in dated text files or delimited lists such as CSV.
Control metadata describes the circumstances of the creation of the data asset, and records when, how, and by whom it has been, and can be, accessed, used, updated, or changed. It should describe who has permission to access and use the data, and how changes to the data are audited and encrypted. It should be recorded in sufficient detail to be able to provide clear evidence of what roles have access to data at all parts of the data’s lifecycle, within the organization.
This records the data’s lineage and history, so that a judgement can be made on whether it is recent enough, reliable enough, and trustworthy. It can also be used to troubleshoot queries. Increasingly, process metadata is also mined for information about how the data is used, such as the software used to access the data, and the level of service that users experienced.
Business metadata, sometimes referred to as external metadata, describes the data in terms that the organization uses. It describes the usage and value it has to the organization, the legal constraints on it such as regulatory compliance, or retention times, its fitness for purpose, information about, and more.
This is worth including because of legislation in place, particularly in Europe, that requires that data can be made available easily for people with disabilities. It is sometimes argued that this is an application issue, but a data catalog can assist in checking compliance if accessibility measures are recorded in the catalog along with the data.
It is possible to ‘big-up’ the role of a data catalog until it grows, in the imagination, into a general-purpose data governance tool. All it should do is to discover and record the facts about how the data is held and used within an organization. It records the current reality, not the aspiration.
As well as providing data discovery, it must allow organizations to improve on the way that data is catalogued in line with changes in the activities of the organization using the data. Different organizations will have different ways of dealing with the information that a data catalog provides, so it is far more important to provide the facility to integrate with a range of tools that facilitate group processes, rather than attempt to provide a magical workflow system that is universally appropriate to any type of organization. A Data Catalog will need to interface with a variety of data governance tools and so connectivity must be easily done.
Was this article helpful?