Building a SQL Server data dictionary

In this article, Edward Pollack explains the benefits of a SQL Server data dictionary and how to build one.

A data dictionary is a documentation tool that provides metadata, metrics, or details about a database and the data within it. Documenting data is critical to maintaining, sharing, and using it, and therefore this is a tool that can save time, improve code quality, and improve communication.

In this article, we will explore the process of designing, architecting, and implementing a data dictionary. But wait, this is only the beginning! A good data dictionary does not need to be entirely manually maintained. Many aspects of metadata, locating new data elements, and reporting/alerting on unusual changes can be built into a data dictionary. Automation allows a majority of the documentation to maintain itself, allowing it to be more accurate, require less manual updating, and be easier to use.

What is a data dictionary?

A data dictionary can be seen as a repository for information about a database. There are no industry standards that go into a data dictionary. It may be as simple as a list of tables with basic descriptions. Alternatively, it can be an extensive list of properties outlining precisely how data is structured, maintained, and used.

Software applications can be purchased and used to maintain a data dictionary. These applications can inspect a database and create metadata lists using internal database properties or their own proprietary data structures. Third-party applications are convenient and easy to use but may be limited in scope and are typically not free; however, they may have features that would be difficult or impossible for an individual to create.

For this discussion, a data dictionary will be built from the ground up using T-SQL. While this work will be done in SQL Server, a similar process could be written in other database systems to perform the same tasks. The benefits of building a data dictionary from scratch are:

  • No monetary costs/licensing/contracts
  • No need to share data with third-party organizations (if applicable)
  • Can customize as much (or as little) as wanted.
  • Can use this data in other applications via APIs or reporting processes

No process is without downsides. Building a data dictionary from scratch (like anything homegrown) entails:

  • Time
  • Resources
  • Maintenance

The larger the project, the more work this will be, and it is quite likely that an exceptionally complex documentation project would benefit from a careful analysis of different approaches to ensure the best one is chosen. Similarly, if the documentation need is modest, then build-your-own will tend to be the most efficient solution.

Generally speaking, these costs will be lower than the time needed to maintain documentation via less organized means. A Word doc or Google Sheet do not represent adequate data documentation. Similarly, having no documentation is the worst-case scenario, and any organized data dictionary would be beneficial when compared to having nothing.

An additional note on lock-in: If a homegrown data dictionary outgrows its environment, then it is always possible to migrate data into another solution (homegrown or 3rd party).

Why a data dictionary? Why not system views?

All database systems include a variety of internal tables and views that allow developers to view object names, data types, and other built-in properties. These views are exceptionally useful for operations and the automation of administrative functions (such as index maintenance, backups, ETL, etc.).

The following query retrieves all columns and some related metadata within a database using system views. You can download all code for this article here.

The results are as follows:

These results are easy for an administrator or developer to consume but have limited use beyond that scope. In addition, there is no human-readable documentation included. Any data dictionary needs to be easily consumable by people in different roles within an organization.

How about extended properties?

Extended properties can provide a way to add metadata inline to objects, like this:

Once added, they can be viewed in the SQL Server Management Studio GUI:

Similarly, they can be queried via T-SQL:

The results of that query are as follows:

Extended properties are a convenient, built-in way to add descriptions to objects. Their documentation ability is limited, though, as you are restricted to a name and value. Using them in practice is similar to managing system views and requires database access or a report that accesses the objects for you.

For very simple and limited details, extended properties provide an easy and effective solution. For anything more extensive, though, they are inadequate and do not constitute the level of depth required of documentation for most organizations.

Building a data dictionary

Building a customized data dictionary is labor-intensive, but the most time-consuming steps are not designing, architecting, and implementing code but collecting details needed to realize business meaning within this metadata.

The first step in this process is to build a table to store data. For this work, the result will be simplified to a single table and not normalized. The reasons for this are to keep the schema simple and easy to consume and to acknowledge that the table will not be large, and therefore complex dimensions are unlikely to be valuable.

What goes into this table? Anything really. If what is needed is not here, add it. If some columns are unnecessary, then remove them:

Each column provides a basic detail about a column that lives in a database somewhere:

data_element_id: A numeric primary key for the table. It can be used to identify a row. The natural key of database_server, database_name, schema_name, table_name, and column_name can also be used as a primary or alternate key.

data_element_name: A informal name for the column. For database objects that are already well named, this may require little effort, but for any that are poorly named, a familiar/casual name can provide great value to anyone using this data.

data_element_type: If OLAP data, then there is value in documenting whether a column represents a fact or dimension. This detail may or not be necessary, depending on the data that is being documented.

data_type: The data type as defined in SQL Server.

data_purpose: The overall purpose of the application that consumes this data. This is something generic to allow a user to understand the over-arching reason that this data exists. This will typically be the same for many tables (or an entire database) that share a common use-case.

entity_purpose: The purpose for this specific table. This will usually be the same for all columns in the table.

database_server, database_name, schema_name, table_name, column_name: These are all taken straight from SQL Server and together describe the unique location and name for the data element.

original_data_source: This is where you can document a data element’s origin if it is not the same as it’s current location. This column is useful for copied, moved, or archived data via ETL or reporting/analytics processes as it can document its original source before it was moved.

notes: A free-form text field for any details about the column, its usage, and its meaning. While anything can go here, it’s worthwhile to maintain a consistent strategy for adding notes so that users of this data know what to expect as they are reading this.

foreign_key_to: If the column references a column in another table, it can be documented here. If a database is relational, then this column can be populated from sys.foreign_keys and sys.foreign_key_columns. If the database is not relational, this will involve some manual or semi-autonomous work to maintain.

row_count: The row count of the table. This value can be collected quickly and easily from sys.dm_db_partition_stats and allows the consumer of this data to understand its size as a table with 1,000 rows will be used differently than a table with 1 billion rows.

data_start_time and data_end_time: If the data has a defined start/end time, it can be provided here. A column for the data retention policy could similarly replace this and provide functionality for columns that undergo regular cleanup or archiving. Essentially, these columns (or something similar) provide detail on when data began to be populated and when (or if) it ends. For a table where columns are added over time but not backfilled, this can be exceptionally useful.

create_time and last_update_time: This documents when a column was added to the data dictionary and when it was last updated.

is_deleted: If data element details will be retained for posterity for deleted columns, then flagging them as deleted is necessary. Additional information could be provided as well, such as the time of deletion and reason.

is_nullable: If data element is a NULLable column, then this will be set to 1.

Once a table is defined, metadata can be inserted. In its most basic usage, data can be inserted manually. This effort would make the table about as efficient as a fully-manual wiki, document, or version-controlled source-code system. For testing, populating data like this is adequate, but long-term, automation is needed.

A fully manual insert would look like this:

Effective, but not terribly exciting. Doing this repeatedly for a large database would be torturously slow and error prone. The next step towards making this documentation faster and more reliable is automation.

Automation with a minimalist style

This effort aims to automate as much as possible but not to create a process that is so complex that it requires immense effort to maintain. The primary downside of any do-it-yourself approach is that whatever is created needs to be maintained (potentially forever).

When the DIY solution is 1,000 lines of PowerShell and 5,000 lines of T-SQL, the maintenance becomes very much non-trivial. Therefore, a target of a few hundred lines of T-SQL would be simple enough that maintaining the code, even by someone unfamiliar with it, would not be a complex task.

First, from the column list above, what bits of metadata do not need to be manually entered? These would include any that originate from system views or tables and can be automatically checked into the data dictionary, such as the database server, database name, column name, or row count.

Realistically, the only columns that require manual intervention are those intended to add human-readable detail, such as original data source, data purpose, and notes. Even the data element name can be automated if columns follow a relatively reliable naming convention. Therefore, automation aims to remove as much work as possible and place what work is necessary on a silver platter for review.

Naming data elements

Everyone has slightly different conventions they follow with regards to creating object or element names. There is value in pre-populating a data element name, even if there is a chance it will not be descriptive enough for permanent use. This can reduce future work, especially in databases that already follow naming conventions that are consistent and descriptive.

To accomplish this, a function can be created to take the column name and perform some string manipulation to turn it into a more friendly name. The resulting data element name may not always be the final or best name, but if it is accurate often enough, then it will be a time-saver for anyone entering or maintaining data dictionary data. The following is a simple function that can accomplish this task:

When an input string is passed into the function, underscores are replaced with spaces, each word is capitalized, and some common abbreviations are capitalized. These are a handful of examples of how text can be formatted to meet some of the naming needs that may arise. Feel free to customize and extend this to whatever other needs may arise!

Note that since this function contains iteration, it should be used on small sets of data or metadata where row-by-row calculations of this nature are acceptable. A data dictionary where data elements are named once and never again is a perfect use-case.

The following T-SQL will illustrate a few uses of this function:

The results show how the column names were formatted to follow a similar naming convention that can form the first step in naming data elements:

Ultimately, this is a process that is destined to be highly customizable. In some cases, it may be unnecessary. Either way, a scalar function is an easy way to encapsulate the logic needed to format names for use in a data dictionary.

Building the data dictionary update process

With some goals and guidelines in mind, the process of building a stored procedure to populate and maintain a data dictionary can begin.

Ideally, this process would have more value if it could poll a set of databases on a server, removing the need to execute it repeatedly for multiple databases. It would also remove the need to modify the calling code whenever a new database is added.

To simplify this, a table type will be created to store database names:

The table type allows a set of databases to be passed into a stored procedure, which is programmatically easier to work with than a comma-separated list or some other text field. Whenever the data dictionary load process runs, a set of database names can be passed in that will be used to filter results to only target those databases. This helps filter out system databases, as well as user databases that are not of interest to the data dictionary.

The following summarizes the steps that this process will follow:

  1. Create a temporary table to hold database names, as well as if they are processed.
  2. Create a temporary table to hold table row count data.
  3. Create a temporary table to hold all other data element metadata.
  4. Iterate through the database list and:
    1. Collect row count data from dm_db_partition_stats
    2. Collect data element metadata
    3. Flag completed databases so they are only processed once
  5. Merge the results captured above into data_element_detail

The goal is a stored procedure that accomplished the basics of what we want and sticks to our ideal of simplicity by doing so in under 200 lines of T-SQL. Please see the attached Data Dictionary creation scripts, which include table creation, the function to format entity names, a table type for use as a table-valued parameter, and the stored procedure

A few collate statements are thrown in when comparing source/target data. Depending on how a given database environment is collated, this may not be necessary. Leaving it causes no harm and does not reduce performance.

Sys.dm_db_partition_stats is used as the source for row count data. The row counts provided here are approximates based on the partition contents, but for data dictionary purposes, are more than accurate. The alternative of selecting counts directly from tables would be expensive and slow enough to not be acceptable for most applications.

Note that many data elements are automatically populated with empty strings or assumptions. These are starting points and can be re-evaluated based on the needs of a data dictionary. The goal is to reduce to an absolute minimum the changes that have to be made directly by a human being. The manual intervention will generally be to review/add/edit human-readable descriptions that otherwise cannot be generated easily from metadata.

Executing the stored procedure requires passing in a set of database names:

Despite there potentially being many columns to document in these databases, the stored procedure executes in under a second. A quick check of the row count confirms how many data elements were found:

By selecting * from the table, the detailed data can be reviewed:

While there are many blanks to potentially be filled in, this provides a great starting point for documenting data elements. Because the create and last update times are maintained, as well as whether a data element is deleted, it is easy to validate change over time and consider how to handle new additions as they arrive.

Data Element Type is populated based on a simplistic formula derived from the column name. If schemas or other conventions are used to determine fact vs. dimension, then adjust to account for it. This is provided as an example of how to handle this data but can be customized to become as useful as possible.

Ignoring data elements

There often are data elements that are not intended for documentation. They may be temporary, flagged for deletion, or internal and not intended for documentation. Either way, exceptions can be added in easily as filters against any of the columns collected above. Most commonly, schema, table, or column names can be filtered to remove anything that should not be documented.

The filters may be managed via an external table that is used to collect metadata about what should not be documented or via filters built into the stored procedure. Which to choose would be based on the number of exceptions to process, how often they change, and who will maintain that list.

A metadata table is easier to maintain for anyone not directly involved in the process, whereas hard-coded exceptions are more permanent and easier to maintain if they are unlikely to ever change.

The manual component

Inevitably, no matter how much effort it put into automation, there will be some amount of data that requires manual data entry or updates. A semi-autonomous data dictionary’s goal is to reduce the volume of manual work as much as possible while making those updates easy and painless.

Of the many fields discussed in this article thus far, some will be seen as required and must be populated for this metadata to be useful. Other data is optional, and some may be unnecessary for some applications.

For example, the foreign_key_to column may not be relevant in tables that are denormalized or that do not use foreign keys. Alternatively, the notes column is universally useful as an opportunity to put human-readable details into documentation

There are two options for mass-updating the manual fields:

  1. Update directly in T-SQL
  2. Update via an application, report, or form

Which option is used depends on who is doing the updating and their comfort level with different tools. If a database developer is updating this data, then T-SQL (or a stored procedure or similar) is an acceptable option. If an analyst is maintaining this data, then a form-based app in SSRS, Tableau, Access, or something homegrown is likely a better option. If the person working with this data is non-technical, then an app or web interface would be ideal to avoid confusion as to how to update data.

Identifying data that needs to be updated is not challenging and could be done with a query similar to this:

Nothing fancy here. Simply determine which fields are required and check for any rows that contain blank/NULL where data is expected. The results can be fed into any application where they can be reviewed and updated. The example above is simplistic, but a real-world application of this would not involve much more code.

Frequently repeated data elements

Some applications have rigid naming conventions and therefore have column names that are unique and sacrosanct. In other words, the column account_id in a database like this always means the same thing and, therefore, always has the same definition. If this is the case, then there is an opportunity to simplify the manual update component of a data dictionary. A process to update metadata for columns that match the definitions of other columns that already have their metadata populated could be modelled in a fashion similar to this:

  1. Has a column been given metadata before?
    1. If no, then manually populate it.
    2. If yes, then proceed to step 2.
  2. Is the column able to copy metadata from a similarly named column that already is complete?
    1. If no, then populate it manually.
    2. If yes, then identify this column and automatically copy metadata for it in the future.

Generally, a data element will have a source-of-truth row that is the primary key of a table and other rows that foreign key back to it. If the source-of-truth is populated first by a process, then the same process can subsequently do the same for all rows that reference it.

The process created above did not include the ability to populate foreign key metadata automatically, but this could be done quickly and accurately. The only caveat is that a database needs to use foreign keys to track dependencies. To do this, it is time to upgrade the scripts above to automatically populate the foreign_key_to column with metadata from SQL Server.

First, a new temporary table is created that will store foreign key metadata:

Next, within the dynamic SQL after row counts are collected, a new block of T-SQL is added to the stored procedure that collects all foreign key metadata from the current database:

This query joins a handful of system views to collect and organize details about every foreign key defined in the database, including column numbers, in case it is a multi-column foreign key. Once collected, this data can be joined to the existing data set that is compiled at the end of the dynamic SQL block by adjusting the existing INSERT into #data_element_list_to_process as follows:

Finally, as an added final step, the foreign_key_to column is added to the UPDATE section of the MERGE statement so that changes to it will be reflected in the data dictionary table:

Rerunning the stored procedure with these changes in place results in foreign key metadata being added to the table. The following is a sample of the newly populated data:

As a bonus, the foreign_key_to column includes notes for multi-column foreign keys, alerting how many columns are in the constraint and the column number for the current column.

With the ability to now automatically populate foreign key metadata, understanding which columns are a “source of truth” and which reference them can be determined programmatically. To test this, sample data will be added to a primary key column:

This results in one column being assigned notes describing its meaning.

If desired, a process could be added to the data dictionary population that joins columns that reference a primary key column and copies data elements that are typically manually entered. A list could be made of columns that this feature should be used for, or all columns could be acted on. A sample query that acts on all columns in this fashion would look like this:

This T-SQL parses the foreign key column and updates notes for all data elements that foreign key directly to a primary key column that has notes entered. The following query can validate the results, returning all ProductID columns that now have notes entered in them:

The results show 12 data elements that have been automatically updated from the update query above:

This process is highly customizable and allows for data to systematically be updated in one place and propagated throughout the data dictionary automatically. This example only updated the notes, but the same process could update any other column for which the metadata would be similar or identical to its parent object. The ultimate goal here is to save time by reducing as much as possible the manual steps required to maintain the data dictionary.

Use in documentation

A data dictionary is not a fully-functional documentation system in itself. It contains details about data elements that can fuel documentation, but this is not a system that is friendly to non-engineers. If a sales manager wants to know what metrics could be useful for a new quarterly report, they have one of two options:

  1. Check the documentation.
  2. Ask you.

Ideally, when a data dictionary nears completion, taking time to determine how to integrate it into existing (or new) documentation systems is important. This is the step that makes this information readily available to an organization, both for technical and non-technical folks. Moreover, having people ask you day-in and day-out about documentation is tedious and a waste of your time 😊

Customization

This data dictionary is exceptionally flexible, and its user can choose to add, remove, or change the details of any data element. The processes described in this article can be reused to populate any amount of columns, metrics, or metadata. The only limit is the imagination of the person implementing it.

Different systems will have different documentation needs. For example, the application name, service name, or software build a column was introduced could be added for analysts with development-leaning needs. Similarly, for operators interested in performance or server details, metrics such as space used, fragmentation, or reads could add those types of details as well. If the question is asked, “Can I add X to the data dictionary?” the answer will always be yes, and likely the process to do so no more complicated than what has been introduced in this article. Adjusting and adding to the code used here can allow for additions, modifications, or removals to be made with minimal effort or risk of breaking it.

Building a SQL Server data dictionary

A data dictionary is a versatile documentation tool that can save time, resources, and money. It can systematically reduce or remove the need for frequent, error-prone, and manual updates to extensive documentation systems. In addition, a data dictionary can be automated to further reduce the maintenance needed as applications grow and develop.

Because it is stored as a table in a database, this data dictionary can be connected to applications easily and consumed by analysts, documentation, or development teams in need or clarification on what a data element is, where it comes from, or how it works. In addition, an application can be used to update data in the data dictionary, further removing a database administrator, developer, or architect from the manual tasks of maintaining documentation and moving those tasks to the people closest to the documentation itself.

Processes that populate a data dictionary can be run regularly to maintain and update automatically populated fields, such as row counts, foreign key data, or auto-updating metadata.

While a homegrown solution is not right for every organization, it provides an easy starting point to realize the benefits of an organized documentation system and highlights the reasons why good database documentation is critical to the success of any teams that rely on databases for the success of their applications.

Customization allows for a data dictionary to be tailored to fit the needs of any application. The code presented in this article provides a starting point from which a data dictionary can evolve to be as efficient as possible. Feel free to take this code and produce the perfect documentation system for a given organization or application. If you add something new and interesting, please let me know all about it 😊