Documenting a SQL Server Data Catalog in HTML and Git Markdown
Dave Poole explains the need for high quality database documentation and then demonstrates how to document the SQL Server database for a data catalog, in both HTML and Git Markdown, using SQL Doc, SQL Data Catalog, PowerShell, and a few helper scripts to ensure consistency and correctness.
Documenting a data catalog repository
The degree to which a data catalog succeeds in making an organization’s data sources discoverable and usable will depend largely on how effectively it presents information about a rich and varied corporate data asset, to an equally rich and varied audience, with diverse needs.
The most successful catalogs will not only describe the data in each source and classify any sensitive data, but also encourage data sharing and collaboration within the organization and help people understand how to use and adapt databases to support new processes.
Although an information tool, such as SQL Data Catalog, should and does provide some built-in visualizations and dashboards, and the facility to export the catalog contents other formats, it is not practical or even possible to provide for all requirements. In addition, full documentation of the database that underpins the Data Catalog must be easily available to users both in HTML form, and in Git Markdown, for storage in a repository. By doing so, we enable the use of business intelligence tools and reporting tools, on the data catalog database, giving the widest possible audience access to information they need about their organization’s data sources.
For this task, Redgate SQL Doc is an excellent complement for generating SQL documentation for the Redgate Data Catalog product.
Why you need a metadata authoring tool
A software engineer’s enthusiasm for having high quality documentation varies inversely with their enthusiasm for being the person who provides it
Documentation is all about making useful information easily digestible. Collaboration with the readers and shared ownership of the documentation mean that the benefits of refactoring and continuous improvement apply every bit as much as they do for software code.
A system becomes “legacy” when people avoid changing it for fear of consequences that are unpredictable in both scope and nature. The term legacy system suggests poor coding structure and hygiene. Being afraid to change a system can also be driven by a lack of easily accessible and consumable information.
In an ideal world, the supply of meaningful descriptions for database objects should be part of standard development practice when objects are created or updated. However, reluctance on the part of the developer, combined with complexity of the SQL Server approach for attaching descriptions to objects means that documentation is less common that it should be.
For example, compare SQL Server’s syntax for documenting database objects, using sp_addextendedproperty and sp_updateextendedproperty, to the simple syntax provided by platforms based on PostGres such as AWS Redshift or Vertica:
1 |
COMMENT ON {object} IS '{meaningful description}'; |
This is where the use of Redgate SQL Doc as a documentation authoring tool becomes invaluable. Data stewards, business analysts and data subject matter experts can carry out the necessary software archeology to attach meaningful descriptions to database objects. By backfilling these descriptions into the source code repository, we can ensure they are not lost if objects are dropped and recreated.
However, to be effective, we need to ensure that SQL Doc produces consistent documentation, applying all the principles of Information Mapping correctly, when it produces database documentation in either HTML or Markdown.
Extending the capabilities of Redgate SQL Doc
Redgate SQL Doc can produce documentation in several formats:
- MS Word
- CHM (compiled HTML Help file)
- HTML
- Git Markdown
We are going to focus on HTML and Git Markdown, as web pages and Git-compatible source control systems are almost ubiquitous. We’re going to make the following enhancements, in line with Information Mapping principles:
Enhancement | Information Mapping Principle | Effect |
Default Git markdown documents (README.md) | Consistency. | Unless a specific file is selected the README.md file is always displayed by default when navigating around a Git repository structure |
Hyperlink support | Accessible Details. | Provides readers with the option to navigate to external supporting content.
An example would be to hyperlink from Redgate Data Catalog SQL Doc output to product documentation |
Support for bulleted lists | Chunking
Relevance. |
When used wisely a bulleted list presents a discreet emphasis on a focused piece of information |
For the purposes of testing our process for adapting Redgate SQL Doc’s output, to achieve the above, I used sp_addextendedproperty
to add some table and column comments to the Redgate Data Catalog database.
The following comment adds a description of the TagCategories
table, using a hyperlink in Git markdown format that will provide users with a reference for further details:
1 2 3 4 5 6 7 8 9 |
EXECUTE sp_addextendedproperty @level0type = N'SCHEMA', @level0name = N'dbo', @level1type = N'TABLE', @level1name = N'TagCategories, @level2type = NULL, @level2name = NULL @name = N'MS_Description', @value = N'A tag category is a grouping entity to which a number of tags can be associated. See [Open taxonomy](https://documentation.red-gate.com/sql-data-catalog/taxonomy/open-taxonomy "SQL Data Catalog taxonomy documentation")'; |
Here, I’m adding a description to the Name column of the AssetType
table, using a bulleted list in Git Markdown format:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
EXECUTE sp_addextendedproperty @level0type = N'SCHEMA', @level0name = N'dbo', @level1type = N'TABLE', @level1name = N'AssetType', @level2type = N'COLUMN', @level2name = N'Name' @name = N'MS_Description', @value = N' Describes the type of object within the data catalog that can be categorized * SQL Server Instance * Database * Schema * Table * Column'; |
Documenting the SQL Data Catalog database
For broader details of my general approach to documenting a SQL Server database, you can refer to my SQLServerCentral.com article Re-oiling the gears for the data dictionary.
Here, I’m going to focus on our automated process for generating an initial documentation set, in HTML and Markdown, and then adapting the output so that the documentation always renders the correct files, correctly and consistently formatted. The diagram below illustrates our overall process:
In short, we execute a PowerShell script that generates the initial documentation set for the SQL Data Catalog, in both formats (Step 1), and then calls each of a set of four PowerShell “helper scripts”. These scripts will ensure that SQL Doc always produce the correct default files for each format (Step 2), with correct and consistent formatting of items such as hyperlinks (Step 3) and bullets (Step 4), and always renders navigable content when browsing through Git folders (Step 5).
STEP 1: Generating the initial database documentation set
The foundations for the first step were laid in the article Getting started with SQLDoc and Powershell by Phil Factor. I used these techniques to create RunSQLDoc.ps1 file as shown below:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 |
Set-Alias sqldoc 'C:\Program Files (x86)\Red Gate\SQL Doc 5\SQLDoc.exe' -Scope Script # Location where RedGate SQLDoc projects are held $SQLDoc_Project = "C:\code\SQLDoc\Projects\RedgateDataCatalog.sqldoc" # Location where RedGate SQLDoc will be output $SQLDoc_Path = "C:\Documentation\SQLDoc\DEEPTHOUGHT_documentation" # Make sure we start with a clean folder if (Test-Path -PathType Container "$SQLDoc_Path") { Get-ChildItem -Path $SQLDoc_Path -Recurse|Remove-Item -Force -Recurse } # Use an array of associative arrays to provide parameters for # Folder creation and SQLDoc runs @( @{ "Path" = "$SQLDoc_Path\Markdown"; FileType = 'md' }, @{ "Path" = "$SQLDoc_Path\HTML"; FileType = 'html' } ) | foreach { $output_folder = $_.Path; $filetype = $_.FileType if (! (Test-Path -PathType Container $output_folder)) { # not there, so we create the report directory New-Item -ItemType Directory -Force -Path $output_folder; } # Generate the output sqldoc /server:DEEPTHOUGHT /database:Redgate_SqlDataCatalog /project:$SQLDoc_Project /outputfolder:"$output_folder" /fileType:"$filetype" if ( $filetype -ieq "md") { & ".\ScriptHelpers\index_to_readme.ps1" "$output_folder" & ".\ScriptHelpers\md_to_bullet_list.ps1" "$output_folder" & ".\ScriptHelpers\object_to_readme.ps1" "$output_folder" } else { & ".\ScriptHelpers\md_to_html_link.ps1" "$output_folder" } } |
It’s important to agree a standard both for the path to the SQL Doc project, and for the location and folder to use when generating output. If you don’t specify the output path then, by default, SQL Doc places it in the user’s personal profile, and saves that location in the .sqldoc project file, and it will not run on another machine that does not have that profile.
The Get-ChildItem..|..Remove-Item
construct is used instead of the SQLDoc /force
option because we are going to rename the index.md files to readme.md. Doing so ensures that the documentation generation has the same starting state every time.
STEP 2: Converting index.md files to readme.md
On a webserver if we use a URL path without specifying a file it will try to render a default document. Unless configured differently this will usually be an index.html in that path. In a Git repository the equivalent of a default document is readme.md.
However, when asked to produce documentation in Git Markdown format SQL Doc simply renders an index.md file instead of the index.html file it would use for web documentation. Therefore, in this step, we need to rename any index.md files to readme.md.
If we rename the index.md files are renamed to readme.md then any links in the markdown documentation set that point to the original index.md files must also be updated. The approach taken was to create an index_to_readme.ps1 file within a ScriptHelpers subfolder, that does this job, as follows:
1 2 3 4 5 6 7 8 9 10 |
[cmdletbinding()] param([string]$Foldername) # Git repos automatically render readme.md files so change the index.md files to readme.md files Get-ChildItem -Path $Foldername -Recurse -File -Filter "index.md"|Rename-Item -Force -NewName "readme.md" # As all references point to index.md change ALL references across all md files to them to point to readme.md Get-ChildItem -Path $Foldername -Recurse -File -Filter "*.md"|ForEach-Object{ (Get-Content -Path $_.FullName).Replace("index.md","readme.md")|Set-Content -Path $_.FullName } |
You’ll see a call to this script in the RunSQLDoc.ps1 file, described in Step 1:
1 |
& ".\ScriptHelpers\index_to_readme.ps1" "$documentation_path" |
When RunSQLDoc.ps1 is run the output will be as follows
- HTML documentation set
- Markdown documentation set
- Default readme.md files
- Documentation breadcrumb trails and links correctly using readme.md as targets
STEP 3: Rendering links embedded in descriptions
SQL Doc already creates hyperlinks between the database objects it is documenting. However, linking to external sources, such as the product documentation, presents us with a challenge. If we attempt to put HTML into object descriptions the product will html-encode it. A hyperlink such as the following:
1 |
<a href=<a href="https://www.red-gate.com/products/dba/sql-data-catalog/">https://www.red-gate.com/products/dba/sql-data-catalog/</a> title="Getting Started with the Redgate SQL Data Catalog">Data catalog</a> |
Will be converted to:
1 |
<a href=https://www.red-gate.com/products/dba/sql-data-catalog/ title="Getting Started with the Redgate SQL Data Catalog"<Data catalog</a> |
Instead, we use markdown (MD) format for the hyperlinks, as follows:
1 |
[SQL Data Catalog]( <a href="https://www.red-gate.com/products/dba/sql-data-catalog/">https://www.red-gate.com/products/dba/sql-data-catalog/</a> "Getting Started with the Redgate SQL Data Catalog">SQL Data Catalog") |
This would render correctly when Redgate SQL Doc publishes Markdown files but not if we asked for HTML files.
Git markdown descriptions and SQL Prompt
A consideration for using Git markdown in descriptions of database objects is that Redgate SQL Prompt will render whatever is in the description as a tooltip. Inclusion of Git markdown may make these tooltips less readable. Consult your SQL Prompt users to check what impact this might have on them.
In any case, you can see that the link structures differ depending on the rendering format:
Format | 1st Param | 2nd Param | 3rd Param |
HTML | Hyperlink | Tooltip text | Link text |
MD | Link text | Hyperlink | Tooltip text |
The hyperlink and link text must both be present. The tooltip text is optional in either case. We need a PowerShell script to extract the three parameters from a Git markdown link and render the equivalent HTML.
For a given HTML file created by Redgate SQL Doc we will want to do the following.
- Read the content of the file
- Find all strings that match the Git markdown hyperlink pattern
- Put each match into an array
- For each unique instance of a link replace the MD link with the HTML equivalent
- Write the modified content back into the file
In our ScriptHelpers folder we create an html_link.ps1 file as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
# The purpose of this script is to replace a git markdown link with an html link. [cmdletbinding()] param([string]$inputFileName) $fileContent = Get-Content $inputFileName $md_markdown_pattern='((?:\[)[\w+\s*]+(?:\]))\(([a-zA-Z0-9:\/_\.\?=-]+)(\s)?("[\w\s]* ")? *\)' $results=$fileContent|Select-String -Pattern $md_markdown_pattern -AllMatches # We only want unique strings because the replace command takes care of multiple instances foreach ($item in $results.Matches.Value|Sort-Object|Get-Unique) { $newItem=$item|Select-String -Pattern $md_markdown_pattern -AllMatches # This is the full markdown link to be replaced $markdown_link = $newItem.Matches.Groups[0] # Build the HTML hyperlink $html_link ='<a href="{0}" title="{2}">{1}' -f $newItem.Matches.Groups[2], $newItem.Matches.Groups[1].ToString().Replace("[","").Replace("]",""), $newItem.Matches.Groups[4].ToString().Replace("" ","") # Replace the MD hyperlink with the HTML hyperlink $fileContent=$fileContent.Replace($markdown_link,$html_link) } Set-Content -Path $inputFileName -Value $fileContent |
The key to this script is to understand the RegEx expression held in the $md_markdown_pattern
variable. This assumes that the URL for the hyperlink does not contain parameters.
The best way to gain an understanding of the RegEx is to post the expression into https://regex101.com/. This online tester provides a good explanation for this RegEx expression and any other RegEx expression you may wish to use.
In short, the expression is set to dissect a Git Markdown hyperlink into 4 parts known as groups
- Group 0 = The entire matched string
- Group 1 = Everything between a pair of square brackets including the brackets themselves
- Group 2 = Everything after a left round bracket
- Group 3 = The space between the Git markdown hyperlink and the end bracket
- Group 4 = Everything between two double-quote characters
We also need a script file to feed in the HTML files into html_link.ps1
For this we create md_to_html_link.ps1 which is as follows.
1 2 3 4 5 6 7 |
[cmdletbinding()] param([string]$Foldername) # Iterate over every HTMNL file in the folder. Get-ChildItem -Path $Foldername -Recurse -File -Filter "*.html"|ForEach-Object{ ./ScriptHelpers/html_link.ps1 $_.FullName } |
Finally, we add the call to our md_to_html_link.ps1 script to the bottom of the RunSQLDoc.ps1 file created in Step 1.
1 |
& ".\ScriptHelpers\md_to_html_link.ps1" "$documentation_path" |
STEP 4: Rendering bullet lists in Git markdown files
When used appropriately, bulleted lists present information in a succinct and eye-catching manner. Although the current version of Redgate SQL Doc does not support bulleted lists directly we can simulate them using git markdown format. Using the dbo.AssetTypes
table of the Redgate Data Catalog database as an example, the following description would render correctly in markdown files and acceptably in HTML output.
Describes the type of object within the data catalog that can be categorized
* SQL Server Instance
* Database
* Schema
* Table
* Column
However, the column descriptions are rendered in git markdown tables and these are fragile and will break for multi-line comments as shown in the “Columns” section of the screen snippet below:
To get the description of the Name
column to render correctly as Git markdown we would have to change it into the following:
1 |
<ul><li>SQL Server Instance</li><li>Database</li><li>Schema</li><li>Table</li><li>Column</li></ul> |
This would allow the markdown to render correctly as follows:
Where Steps 2 and 3 were simple replace statements, editing files using Get-Content/Set-
Content, Step 4 requires us to understand when a number of things about our bullet list:
- Whether it in the
Description
column of the database columns section of the document - Where it begins and ends
- Whether it has any preceding content
I found the easiest way to implement the ruleset was to read the original file, write out a modified version to a temporary file and then copy the temporary file back over the original at the end of the process
The code to implement the various rules is as shown below (MarkdownList.ps1):
As the code above works on a single file, we need a script that will iterate over the markdown files feeding each one into this script. For this we create a md_to_bullet_list.ps1 list as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
[cmdletbinding()] param([string]$Foldername) # Iterate over every HTML file in the folder Get-ChildItem -Path $Foldername -Recurse -File -Filter "*.md" -Exclude "tmp_output.md"|ForEach-Object{ $temporary_file = Join-Path -Path $_.DirectoryName -ChildPath "tmp_output.md" ./ScriptHelpers/MarkDownList.ps1 $_.FullName > $temporary_file # Copy the tmp_output.md file back over the original file. Copy-Item -Path $temporary_file -Destination $_.FullName -Force # Clean up the temporary file Remove-Item -Path $temporary_file } |
Finally, we add the call to our md_to_bullet_list.ps1 script to the bottom of the RunSQLDoc.ps1 file created in Step 1.
1 |
& ".\ScriptHelpers\md_to_bullet_list.ps1" "$documentation_path" |
STEP 5: Default README.md for main objects
Redgate SQL Doc renders a file that provides a descriptive list of the contents of folder.
- In the Tables folder there is a Tables.md file
- In the Views folder there is a Views.md file
- …etc.…
Simply copying the md file whose name patches the parent folder name to README.md will ensure that browsing git folders will always render navigable content
We can create an object_to_readme.ps1 file in our ScriptHelpers folder as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
[cmdletbinding()] param([string]$Foldername) Clear-Host Get-ChildItem -Path $Foldername -Recurse -File -Filter "*.md"|ForEach-Object{ # We are only interested in the immediate parent folder and the file without any extension $ParentFolder = (Get-Item $_.Directory).BaseName $MarkDownFileWithoutExtension = $_.BaseName if($ParentFolder -eq $MarkDownFileWithoutExtension){ $readme_file = Join-Path -Path $_.DirectoryName -ChildPath "README.md" Copy-Item -Path $_.FullName -Destination $readme_file -Force } } |
Finally, we add the call to our object_to_readme.ps1 script to the bottom of the RunSQLDoc.ps1 file created in Step 1.
1 |
& ".\ScriptHelpers\object_to_readme.ps1" "$documentation_path" |
Even if someone accesses a folder in the Git repository rather than navigating the documents using the Redgate SQL Doc generated hyperlinks they will still see the SQL Doc rendered list of objects relevant to that folder.
Closing thoughts
Amazon S3 storage, Google Storage and Azure storage all support static website hosting. This means that the HTML output from Redgate SQL Doc can be hosted at extremely low-cost while being secured to only internal company users.
Git-based PaaS products do much the same for the Git Markdown produced by the product.
Whether you use a Git based repository or website to host the output from Redgate SQL Doc, the simple changes described in this article will benefit software and data engineers, business and data analysts and any other company audience who need to understand the structure of the company databases.
By providing an easily accessible source of information about internal database systems you are mitigating a risk that manifests in a fear to change a system. In so doing you are helping to prevent a working system being prematurely classed as “legacy”.