{"id":100890,"date":"2023-12-18T14:47:04","date_gmt":"2023-12-18T14:47:04","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=100890"},"modified":"2023-12-14T15:42:19","modified_gmt":"2023-12-14T15:42:19","slug":"using-a-sql-tokenizer","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/theory-and-design\/using-a-sql-tokenizer\/","title":{"rendered":"Using a SQL Tokenizer"},"content":{"rendered":"<p>Imagine this: you have several directories full of SQL script files, and you need to know where a certain table is used. You\u2019d rather like the context too, so you can check the whole SQL Expression and work out why it is running so slowly. Maybe, from that same daunting set of several directories, you need to search for a comment, either end of line, or block comment, perhaps structured. It could be that you just need to execute each query or statement in turn to <a href=\"https:\/\/www.red-gate.com\/hub\/product-learning\/flyway\/running-structured-database-tests-in-flyway\">check performance<\/a>.<\/p>\n<p>It\u2019s not unusual to want to search just within strings. Although simple searches can be effective, you will at some point need a tool that is able to recognise and return a collection of strings representing the SQL code, divided up into the respective components of the SQL language.<\/p>\n<p>For this article, we\u2019ll use a PowerShell cmdlet, called <a href=\"https:\/\/github.com\/Phil-Factor\/FlywayTeamwork-Pubs\/blob\/main\/Resources\/Tokenize-SQLString.ps1\">Tokenize-SQLString<\/a>, which is in my GitHub repository.<\/p>\n<div class=\"c-message_kit__gutter__right\" role=\"presentation\" data-qa=\"message_content\">\n<div class=\"c-message_kit__blocks c-message_kit__blocks--rich_text\">\n<div class=\"c-message__message_blocks c-message__message_blocks--rich_text\" data-qa=\"message-text\">\n<div class=\"p-block_kit_renderer\" data-qa=\"block-kit-renderer\">\n<div class=\"p-block_kit_renderer__block_wrapper p-block_kit_renderer__block_wrapper--first\">\n<div class=\"p-rich_text_block\" dir=\"auto\">\n<div class=\"p-rich_text_section\">Just so you can get the feel for what a tokenizer does, here is an example, where someone has embedded SQL statements that might mislead the unwary.<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<\/div>\n<pre class=\"lang:ps theme:powershell-ise\">@'\r\n\/* Select * from dbo.othertable *\/\r\ndrop view if exists dbo.Book_Purchases_By_Date; \r\n--drop the Book_Purchases_By_Date view\r\nSelect 'create table dbo.YetAnothertable'\r\n'@|Tokenize-SQLString|Out-GridView <\/pre>\n<p>This will provide the following stream of objects and finds the reference:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"610\" height=\"258\" class=\"wp-image-100891\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/12\/a-screenshot-of-a-computer-description-automatica-9.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>You\u2019ll notice that it provides the location in the string (in the <code>Index<\/code> column of the output), and even the column and line in the query text. I&#8217;ve described elsewhere how to use the same technique to <a href=\"https:\/\/www.red-gate.com\/hub\/product-learning\/flyway\/finding-external-references-in-database-deployments\">find external (3- and 4-part) references<\/a>.<\/p>\n<p>To achieve all this, you might think that nothing, but a parser will do. Sure, it would be nice if you have unlimited time, but really all you need is a tokenizer. The end-product of a tokenizer is generally the input to a parser. The tokenizer splits the input stream into individual language tokens such as strings, identifiers, keywords, and punctuation. It doesn\u2019t need to be recursive, because, although elements of the SQL language are recursive, the input stream isn\u2019t recursive. I use a tokenizer for some routine tasks such as for transferring comments associated with a table\u2019s creation, and its columns, to the database metadata (assuming the RDBMS allows this).<\/p>\n<h2>Putting a tokenizer to work<\/h2>\n<p>As a playground, we can use for experiment a collection of SQL Expressions and DDL statements called the SQL Exerciser for Pubs (two flavours so far, <a href=\"https:\/\/github.com\/Phil-Factor\/FlywayTeamwork-Pubs\/blob\/main\/Pubs\/Tests\/TheSQLServerPubsExerciser.sql\">SQL Server<\/a> and <a href=\"https:\/\/github.com\/Phil-Factor\/FlywayTeamwork-Pubs\/blob\/main\/PubsPG\/Branches\/Develop\/Tests\/PGPubsExerciser.sql\">PostgreSQL<\/a>) It includes several different types of SQL Expression.<\/p>\n<h3>List different types of documentation<\/h3>\n<p>Here we list all the Javadoc documentation in the file (the fields can be formatted in Javadoc format or as YAML). <a href=\"https:\/\/www.baeldung.com\/javadoc\">JavaDocs<\/a> remains a very good structured way of documenting any database objects. They are easily extracted by a tokenizer and then used to generate documentation and to apply \u2018comments\u2019 to the objects in whatever way is supported by the RDBMS. (in the following code, the multiline comment is a JavaDoc due to the preceding <code>\/**<\/code> instead of the single asterisk character <code>\/*<\/code>.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">$TheFileName = '&lt;MyPathTo&gt;\\PGPubsExerciser.sql'\r\nTokenize-SQLString (Get-Content $TheFileName -raw)|\r\nwhere {$_.Name  -ieq  'JavaDoc'}<\/pre>\n<p>I\u2019ll just show the first returned object:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" class=\"wp-image-100892\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/12\/a-blue-screen-with-white-text-description-automat.png\" alt=\"A blue screen with white text\n\nDescription automatically generated\" width=\"857\" height=\"297\" \/><\/p>\n<p>We can just as easily get the block comments. Having got these, it is very easy to search for strings within them.<\/p>\n<pre class=\"lang:ps theme:powershell-ise \">$TheFileName = '&lt;MyPathTo&gt;\\PGPubsExerciser.sql'\r\nTokenize-SQLString (Get-Content $TheFileName -raw)|\r\n  where {$_.Name  -in ('BlockComment','EndOfLineComment')}|\r\n   foreach {[pscustomobject]@{\u2018line\u2019=$_.line;\u2019comment\u2019=$_.Value }}|\r\n    Out-GridView <\/pre>\n<p>This outputs the following list in a new window.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"669\" height=\"297\" class=\"wp-image-100893\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/12\/a-screenshot-of-a-computer-description-automatica-10.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<p>I\u2019ve simply filtered for just those tokens that, in the first example, are Javadoc, and in the second example are either <code>BlockComment<\/code> or <code>EndOfLineComment<\/code>. Any of these token names can be used to filter with <code>JavaDoc<\/code><em> , <\/em><code>BlockComment<\/code><em>, <\/em><code>EndOfLineComment<\/code><em>, <\/em><code>String<\/code><em>, <\/em><code>number<\/code><em>, <\/em><code>identifier<\/code><em>, <\/em><code>Operator<\/code> and <code>Punctuation<\/code>.<\/p>\n<h3>Chopping up SQL expressions into an array of executable strings<\/h3>\n<p>For the next example, we\u2019ll chop up a file (or variable) full of SQL expressions and batches into a string array that you can then execute, record or squirrel away however you wish.<\/p>\n<pre class=\"lang:ps theme:powershell-ise \"># The path and filename of the sql file you want chopped up\r\n$TheFileName = '&lt;MyPathTo&gt;\\PGPubsExerciser.sql' \r\n$Terminators =@('go',';') \r\n&lt;# The list of terminators for SQL Expressions or batches\r\nin SQL Server, you may just want the batches, in which as it would be\r\n- $Terminators =@('go') #&gt;\r\n#you'll need the contents in a variable \r\n$TheScriptContent = Get-Content $TheFileName -raw\r\n#This is the main pipeline that produces an array of SQL \r\n#Expressions in the original order \r\n$Expressions=$TheScriptContent|Tokenize-SQLString | where {\r\n    $_.value -in $Terminators\r\n  } | # because we know the position in the file from the   \r\n#tokenizer, that's all we need to chop the file up\r\n    foreach -begin { $StartIndex = 0 } {\r\n    if ($_.value -eq 'GO') {$TerminatorLength=2} else \r\n                                      {$TerminatorLength=0}\r\n    $EndIndex = $_.Index; \r\n#We chop up the string occuring before the terminator\r\n    \"$($TheScriptContent.Substring(\r\n            $StartIndex, $EndIndex - $StartIndex \u2013 \r\n                                         $TerminatorLength+1\r\n            ).Trim())\" # chopped up and trimmed\r\n        $StartIndex = $EndIndex + $TerminatorLength+1;\r\n  }|where { !([string]::IsNullOrWhiteSpace($_))} \r\n#no point in empty batches.<\/pre>\n<p>You can then use the following code to make a prettier, textual listing\u201d<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">1..$Expressions.count | foreach {\r\n  [pscustomobject] @{'No.'=$_;'Expression'=$Expressions[$_-1]}} <\/pre>\n<p>This will output the following listing:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"946\" height=\"689\" class=\"wp-image-100894\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/12\/a-blue-screen-with-white-text-description-automat-1.png\" alt=\"A blue screen with white text\n\nDescription automatically generated\" \/><\/p>\n<p>This will, obviously, store block comments and Javadocs in the same string. There will be times, in SQL Server, when you will want to use the client-side batch delimiter \u2018GO&#8217; to run entire batches rather than simple expressions, and so I\u2019ve scripted this to allow for both types of statement terminators\/batch separator, and any other batch separator you might be using. It removes \u2018GO\u2019 as it is client-side only but leaves the semicolon.<\/p>\n<h3>Return all table references.<\/h3>\n<p>Next, we\u2019ll list all the permanent tables that are referenced in a file full of SQL Statements. It\u2019s a bit more complicated, but manageable.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">$TheFileName = '&lt;MyPathTo&gt;\\PGPubsExerciser.sql'\r\n$Terminators =@('go',';') &lt;# The list of terminators for SQL \r\n#Expressions or batches\r\n in SQL Server, you may just want the \r\n#batches, in which as it would be  - $Terminators =@(\u2018go\u2019) #&gt;\r\n# The path and filename of the sql file you want chopped up\r\n# you'll need the contents in a variable \r\n$TheScriptContent = Get-Content $TheFileName -raw\r\nGet-Content $TheFileName -raw|Tokenize-SQLString|foreach {\r\n    if (($_.Type -like '*Part Dotted Reference') `\r\n                   -and ($_.Name -eq 'identifier'))\r\n            { write-output $_.Value; }\r\n         }|  Sort-Object -Unique <\/pre>\n<p>This will return a list that looks similar to the following:<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"949\" height=\"242\" class=\"wp-image-100895\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/12\/a-blue-rectangle-with-white-lines-description-aut.png\" alt=\"A blue rectangle with white lines\n\nDescription automatically generated\" \/><\/p>\n<p>It will include every table reference in the file, but not the local table references as you will get with CTEs.<\/p>\n<h3>Return table references separately for every SQL expression.<\/h3>\n<p>In a previous example, we separated each SQL Statement. For the next example, we want to get all the table references in every SQL Expression separately, including those in SQL Functions and views. This comes in handy where you cannot get dependencies from the live database, if you are using a database system that doesn\u2019t provide them.<\/p>\n<p><em>Note: To execute the following code you will need to install Powershell YAML that is located here: <a href=\"https:\/\/github.com\/cloudbase\/powershell-yaml\">https:\/\/github.com\/cloudbase\/powershell-yaml<\/a> using the following call::<\/em><\/p>\n<pre class=\"lang:ps theme:powershell-ise\">Install-Module powershell-yaml<\/pre>\n<p>Executing the following block of code:<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">$TheFileName = '&lt;MyPathTo&gt;\\PGPubsExerciser.sql'\r\n$TheScriptContent = Get-Content $TheFileName -raw\r\n$Terminator = ';'\r\n$TableSources = Tokenize-SQLString $TheScriptContent |\r\nforeach -begin { $StartIndex = 0; $TableSource = @() } {\r\n   if ($_.value -eq $Terminator)\r\n   {\r\n       $EndIndex = $_.Index + $_.Length;\r\n       if ($TableSource.count -gt 0)\r\n       {\r\n            #If we found a tablesource being used\r\n            @{\r\n              'TableSources' = ($TableSource | `\r\n                                   Sort-Object -Unique);\r\n                   'SQLExpression' = ` \r\n                             $TheScriptContent.Substring(\r\n                     $StartIndex, $EndIndex - $StartIndex);\r\n              }\r\n       }\r\n       #move past the terminator\r\n       $StartIndex = $EndIndex + $_.Length; \r\n       $TableSource = @(); #Zero out the list of table sources\r\n     }\r\n     if (($_.Type -like '*Part Dotted Reference'))\r\n            { $TableSource += $_.Value; \r\n         \r\n      }\r\n}\r\n$TableSources | foreach{ $_ | convertto-YAML }<\/pre>\n<p>This results in the following partial output (Includes the final expression):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"946\" height=\"549\" class=\"wp-image-100896\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/12\/a-screenshot-of-a-computer-description-automatica-11.png\" alt=\"A screenshot of a computer\n\nDescription automatically generated\" \/><\/p>\n<h3>Colourising SQL Code<\/h3>\n<p>One can easily use the tokeniser to convert a SQL script into an HTML document (string) suitably colourised just like the IDE you use. This is useful for documentation and for quickly scanning code. At the moment, it is set for SSMS but this can be changed. It is easy to tweak it for the colourisation scheme that the developers are used to. The result can, of course, be viewed in a browser or browser-based IDE. The Title, the HTML header and the footer can be specified. This is a bit long to include with the article but the source is <a href=\"https:\/\/github.com\/Phil-Factor\/FlywayTeamwork-Pubs\/blob\/main\/Resources\/Convert-SQLtoHTML.ps1\">here as Convert-SQLtoHTML in Github<\/a>.<\/p>\n<p>It isn\u2019t hard to use.<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">Convert-SQLtoHTML -SQLScript 'Select * from The_Table' <\/pre>\n<p>If you look at the output, you can see that it has broken down each of the items, and then added font colour tags:<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">&lt;!DOCTYPE html&gt;\r\n&lt;html&gt;\r\n&lt;head&gt;\r\n    &lt;title&gt;The SQL Code&lt;\/title&gt;\r\n&lt;\/head&gt;\r\n&lt;body&gt;\r\n    &lt;pre&gt;&lt;font color=\"Blue\"&gt;Select&lt;\/font&gt;&lt;font color=\"black\"&gt; *&lt;\/font&gt;&lt;font color=\"Blue\"&gt; from&lt;\/font&gt;&lt;font color=\"black\"&gt; The_Table&lt;\/font&gt;&lt;\/pre&gt;\r\n&lt;\/body&gt;\r\n&lt;\/html&gt;<\/pre>\n<p>You can feed it a file as well. We can try it out on our sample PostgreSQL code that we used in the previous examples:<\/p>\n<pre class=\"lang:ps theme:powershell-ise\">$TheFileName = '&lt;MyPathTo&gt;\\PGPubsExerciser.sql'\r\n$TheScriptContent = Get-Content $TheFileName -raw\r\nConvert-SQLtoHTML -SQLScript $TheScriptContent &gt; \".\\colorised.html\"\r\nStart-Process -FilePath \".\\colorised.html\" <\/pre>\n<p>Which (depending on your configuration,) will display in your browser like this \u2026<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"718\" height=\"747\" class=\"wp-image-100897\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/12\/a-screenshot-of-a-computer-program-description-au-1.png\" alt=\"A screenshot of a computer program\n\nDescription automatically generated\" \/><\/p>\n<p>I find this very handy for object-level scripts, where each object is converted into its HTML form. It makes a very simple documentation of the state of a database that is easy to navigate. I tend to save the tokeniser output to make it easier to do subsequent searches for specific strings such as table references.<\/p>\n<h2>Conclusions<\/h2>\n<p>I use the tokenizer mainly for test purposes, for exercising individual queries from a longer SQL file that I can edit in a SQL Editor, and getting timings from them. I also use it for working out what tables are, and are not, being used within views and functions. However, it is also handy for a range of tasks such as formatting SQL or syntax highlighting it. I use it for getting and preserving the comments around <code>CREATE<\/code> statements so that the live database objects and columns can have comments attached to them. I suspect that there are other uses. It is an open-ended tool for the database developer who is a PowerShell user .<\/p>\n<p>Note: For an additional practical application of the tokenizer, check out the Redgate Product Learning article: \u00a0<a class=\"c-link\" href=\"https:\/\/www.red-gate.com\/hub\/product-learning\/flyway\/code-visibility-browsing-through-flyway-migration-files\" target=\"_blank\" rel=\"noopener noreferrer\" data-stringify-link=\"https:\/\/www.red-gate.com\/hub\/product-learning\/flyway\/code-visibility-browsing-through-flyway-migration-files\" data-sk=\"tooltip_parent\">Code Visibility: Browsing through Flyway Migration Files<\/a>. It shows how to use the tokenizer to colorize a set of Flyway migration files (in HTML, such as for web documentation).<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Imagine this: you have several directories full of SQL script files, and you need to know where a certain table is used. You\u2019d rather like the context too, so you can check the whole SQL Expression and work out why it is running so slowly. Maybe, from that same daunting set of several directories, you&#8230;&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143539],"tags":[],"coauthors":[6813],"class_list":["post-100890","post","type-post","status-publish","format-standard","hentry","category-featured","category-theory-and-design"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/100890","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=100890"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/100890\/revisions"}],"predecessor-version":[{"id":100904,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/100890\/revisions\/100904"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=100890"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=100890"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=100890"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=100890"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}