Product articles
SQL Data Catalog
Database Documentation
Documenting a SQL Server Data Catalog…

Documenting a SQL Server Data Catalog in HTML and Git Markdown

Dave Poole explains the need for high quality database documentation and then demonstrates how to document the SQL Server database for a data catalog, in both HTML and Git Markdown, using SQL Doc, SQL Data Catalog, PowerShell, and a few helper scripts to ensure consistency and correctness.

Documenting a data catalog repository

The degree to which a data catalog succeeds in making an organization’s data sources discoverable and usable will depend largely on how effectively it presents information about a rich and varied corporate data asset, to an equally rich and varied audience, with diverse needs.

The most successful catalogs will not only describe the data in each source and classify any sensitive data, but also encourage data sharing and collaboration within the organization and help people understand how to use and adapt databases to support new processes.

Although an information tool, such as SQL Data Catalog, should and does provide some built-in visualizations and dashboards, and the facility to export the catalog contents other formats, it is not practical or even possible to provide for all requirements. In addition, full documentation of the database that underpins the Data Catalog must be easily available to users both in HTML form, and in Git Markdown, for storage in a repository. By doing so, we enable the use of business intelligence tools and reporting tools, on the data catalog database, giving the widest possible audience access to information they need about their organization’s data sources.

For this task, Redgate SQL Doc is an excellent complement for generating SQL documentation for the Redgate Data Catalog product.

Why you need a metadata authoring tool

A software engineer’s enthusiasm for having high quality documentation varies inversely with their enthusiasm for being the person who provides it

Documentation is all about making useful information easily digestible. Collaboration with the readers and shared ownership of the documentation mean that the benefits of refactoring and continuous improvement apply every bit as much as they do for software code.

A system becomes “legacy” when people avoid changing it for fear of consequences that are unpredictable in both scope and nature. The term legacy system suggests poor coding structure and hygiene. Being afraid to change a system can also be driven by a lack of easily accessible and consumable information.

In an ideal world, the supply of meaningful descriptions for database objects should be part of standard development practice when objects are created or updated. However, reluctance on the part of the developer, combined with complexity of the SQL Server approach for attaching descriptions to objects means that documentation is less common that it should be.

For example, compare SQL Server’s syntax for documenting database objects, using sp_addextendedproperty and sp_updateextendedproperty, to the simple syntax provided by platforms based on PostGres such as AWS Redshift or Vertica:

This is where the use of Redgate SQL Doc as a documentation authoring tool becomes invaluable. Data stewards, business analysts and data subject matter experts can carry out the necessary software archeology to attach meaningful descriptions to database objects. By backfilling these descriptions into the source code repository, we can ensure they are not lost if objects are dropped and recreated.

However, to be effective, we need to ensure that SQL Doc produces consistent documentation, applying all the principles of Information Mapping correctly, when it produces database documentation in either HTML or Markdown.

Extending the capabilities of Redgate SQL Doc

Redgate SQL Doc can produce documentation in several formats:

  • MS Word
  • CHM (compiled HTML Help file)
  • PDF
  • HTML
  • Git Markdown

We are going to focus on HTML and Git Markdown, as web pages and Git-compatible source control systems are almost ubiquitous. We’re going to make the following enhancements, in line with Information Mapping principles:

Enhancement Information Mapping Principle Effect
Default Git markdown documents (README.md) Consistency. Unless a specific file is selected the README.md file is always displayed by default when navigating around a Git repository structure
Hyperlink support Accessible Details. Provides readers with the option to navigate to external supporting content.

An example would be to hyperlink from Redgate Data Catalog SQL Doc output to product documentation

Support for bulleted lists Chunking

Relevance.

When used wisely a bulleted list presents a discreet emphasis on a focused piece of information

For the purposes of testing our process for adapting Redgate SQL Doc’s output, to achieve the above, I used sp_addextendedproperty to add some table and column comments to the Redgate Data Catalog database.

The following comment adds a description of the TagCategories table, using a hyperlink in Git markdown format that will provide users with a reference for further details:

Here, I’m adding a description to the Name column of the AssetType table, using a bulleted list in Git Markdown format:

Documenting the SQL Data Catalog database

For broader details of my general approach to documenting a SQL Server database, you can refer to my SQLServerCentral.com article Re-oiling the gears for the data dictionary.

Here, I’m going to focus on our automated process for generating an initial documentation set, in HTML and Markdown, and then adapting the output so that the documentation always renders the correct files, correctly and consistently formatted. The diagram below illustrates our overall process:

Documenting a SQL Server database in Git Markdown and HTML

In short, we execute a PowerShell script that generates the initial documentation set for the SQL Data Catalog, in both formats (Step 1), and then calls each of a set of four PowerShell “helper scripts”. These scripts will ensure that SQL Doc always produce the correct default files for each format (Step 2), with correct and consistent formatting of items such as hyperlinks (Step 3) and bullets (Step 4), and always renders navigable content when browsing through Git folders (Step 5).

STEP 1: Generating the initial database documentation set

The foundations for the first step were laid in the article Getting started with SQLDoc and Powershell by Phil Factor. I used these techniques to create RunSQLDoc.ps1 file as shown below:

It’s important to agree a standard both for the path to the SQL Doc project, and for the location and folder to use when generating output. If you don’t specify the output path then, by default, SQL Doc places it in the user’s personal profile, and saves that location in the .sqldoc project file, and it will not run on another machine that does not have that profile.

The Get-ChildItem..|..Remove-Item construct is used instead of the SQLDoc /force option because we are going to rename the index.md files to readme.md. Doing so ensures that the documentation generation has the same starting state every time.

STEP 2: Converting index.md files to readme.md

On a webserver if we use a URL path without specifying a file it will try to render a default document. Unless configured differently this will usually be an index.html in that path. In a Git repository the equivalent of a default document is readme.md.

However, when asked to produce documentation in Git Markdown format SQL Doc simply renders an index.md file instead of the index.html file it would use for web documentation. Therefore, in this step, we need to rename any index.md files to readme.md.

If we rename the index.md files are renamed to readme.md then any links in the markdown documentation set that point to the original index.md files must also be updated. The approach taken was to create an index_to_readme.ps1 file within a ScriptHelpers subfolder, that does this job, as follows:

You’ll see a call to this script in the RunSQLDoc.ps1 file, described in Step 1:

When RunSQLDoc.ps1 is run the output will be as follows

  • HTML documentation set
  • Markdown documentation set
  • Default readme.md files
  • Documentation breadcrumb trails and links correctly using readme.md as targets

STEP 3: Rendering links embedded in descriptions

SQL Doc already creates hyperlinks between the database objects it is documenting. However, linking to external sources, such as the product documentation, presents us with a challenge. If we attempt to put HTML into object descriptions the product will html-encode it. A hyperlink such as the following:

Will be converted to:

Instead, we use markdown (MD) format for the hyperlinks, as follows:

This would render correctly when Redgate SQL Doc publishes Markdown files but not if we asked for HTML files.

Git markdown descriptions and SQL Prompt

A consideration for using Git markdown in descriptions of database objects is that Redgate SQL Prompt will render whatever is in the description as a tooltip. Inclusion of Git markdown may make these tooltips less readable. Consult your SQL Prompt users to check what impact this might have on them.

In any case, you can see that the link structures differ depending on the rendering format:

Format 1st Param 2nd Param 3rd Param
HTML Hyperlink Tooltip text Link text
MD Link text Hyperlink Tooltip text

The hyperlink and link text must both be present. The tooltip text is optional in either case. We need a PowerShell script to extract the three parameters from a Git markdown link and render the equivalent HTML.

For a given HTML file created by Redgate SQL Doc we will want to do the following.

  1. Read the content of the file
  2. Find all strings that match the Git markdown hyperlink pattern
  3. Put each match into an array
  4. For each unique instance of a link replace the MD link with the HTML equivalent
  5. Write the modified content back into the file

In our ScriptHelpers folder we create an html_link.ps1 file as follows.

The key to this script is to understand the RegEx expression held in the $md_markdown_pattern variable. This assumes that the URL for the hyperlink does not contain parameters.

The best way to gain an understanding of the RegEx is to post the expression into https://regex101.com/. This online tester provides a good explanation for this RegEx expression and any other RegEx expression you may wish to use.

In short, the expression is set to dissect a Git Markdown hyperlink into 4 parts known as groups

  • Group 0 = The entire matched string
  • Group 1 = Everything between a pair of square brackets including the brackets themselves
  • Group 2 = Everything after a left round bracket
  • Group 3 = The space between the Git markdown hyperlink and the end bracket
  • Group 4 = Everything between two double-quote characters

We also need a script file to feed in the HTML files into html_link.ps1

For this we create md_to_html_link.ps1 which is as follows.

Finally, we add the call to our md_to_html_link.ps1 script to the bottom of the RunSQLDoc.ps1 file created in Step 1.

STEP 4: Rendering bullet lists in Git markdown files

When used appropriately, bulleted lists present information in a succinct and eye-catching manner. Although the current version of Redgate SQL Doc does not support bulleted lists directly we can simulate them using git markdown format. Using the dbo.AssetTypes table of the Redgate Data Catalog database as an example, the following description would render correctly in markdown files and acceptably in HTML output.

Describes the type of object within the data catalog that can be categorized
* SQL Server Instance
* Database
* Schema
* Table
* Column

However, the column descriptions are rendered in git markdown tables and these are fragile and will break for multi-line comments as shown in the “Columns” section of the screen snippet below:

poorly formatted bullets lists in database documentation

To get the description of the Name column to render correctly as Git markdown we would have to change it into the following:

This would allow the markdown to render correctly as follows:

Correctly formatted bullet lists in Git Markdown files

Where Steps 2 and 3 were simple replace statements, editing files using Get-Content/Set-Content, Step 4 requires us to understand when a number of things about our bullet list:

  • Whether it in the Description column of the database columns section of the document
  • Where it begins and ends
  • Whether it has any preceding content

I found the easiest way to implement the ruleset was to read the original file, write out a modified version to a temporary file and then copy the temporary file back over the original at the end of the process

The code to implement the various rules is as shown below (MarkdownList.ps1):


As the code above works on a single file, we need a script that will iterate over the markdown files feeding each one into this script. For this we create a md_to_bullet_list.ps1 list as follows.

Finally, we add the call to our md_to_bullet_list.ps1 script to the bottom of the RunSQLDoc.ps1 file created in Step 1.

STEP 5: Default README.md for main objects

Redgate SQL Doc renders a file that provides a descriptive list of the contents of folder.

  • In the Tables folder there is a Tables.md file
  • In the Views folder there is a Views.md file
  • …etc.…

Simply copying the md file whose name patches the parent folder name to README.md will ensure that browsing git folders will always render navigable content

We can create an object_to_readme.ps1 file in our ScriptHelpers folder as follows.

Finally, we add the call to our object_to_readme.ps1 script to the bottom of the RunSQLDoc.ps1 file created in Step 1.

Even if someone accesses a folder in the Git repository rather than navigating the documents using the Redgate SQL Doc generated hyperlinks they will still see the SQL Doc rendered list of objects relevant to that folder.

Closing thoughts

Amazon S3 storage, Google Storage and Azure storage all support static website hosting. This means that the HTML output from Redgate SQL Doc can be hosted at extremely low-cost while being secured to only internal company users.

Git-based PaaS products do much the same for the Git Markdown produced by the product.

Whether you use a Git based repository or website to host the output from Redgate SQL Doc, the simple changes described in this article will benefit software and data engineers, business and data analysts and any other company audience who need to understand the structure of the company databases.

By providing an easily accessible source of information about internal database systems you are mitigating a risk that manifests in a fear to change a system. In so doing you are helping to prevent a working system being prematurely classed as “legacy”.

 

Tools in this post

SQL Data Catalog

Accelerate identification and classification of sensitive data

Find out more

SQL Doc

Document your SQL Server databases automatically

Find out more

SQL Toolbelt Essentials

The industry-standard tools for SQL Server development and deployment.

Find out more