Product articles Redgate Flyway Database Testing and Quality
Creating Custom Regex Rules for Code…

Creating Custom Regex Rules for Code Analysis in Flyway

Every development team has its own coding standards. While Flyway provides built-in rules focused on security and data protection, teams quickly find they need additional checks to maintain SQL code quality and consistency across their projects. This article will walk through the ways of defining and testing custom rules, provides a starter pack of custom rules for common code smells, and then explains how to manage your rules efficiently as your rule library grows.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Code smells are the mistakes, omissions, and vulnerabilities in the SQL code that makes it harder for a team to subsequently modify and extend the work and can also lead to unpredictable behavior and performance. They are one measure of the extent of the ‘technical debt’ in existing code. If you remove these issues before they enter the main migration chain, you’ll experience fewer deployment issues and, over time, reduce maintenance costs.

Flyway will help you detect these SQL code smells by running static code analysis on the database and the Flyway migration scripts, using the check -code command. This command uses a SQL linter called SQL Fluff for parser-based analysis and regex rules that allow teams to provide custom checks to analyze SQL migration scripts. It produces a JSON report of any rule violations.

This article is about the regex rules. My previous article described the basics or running the rules, and this one is about writing custom rules, and how to grow your regex rule library in a structured and manageable way. Regex search is a relatively crude instrument because cannot parse SQL syntax in the way a SQL linter does. They treat SQL as plain text, meaning they can’t distinguish between keywords, identifiers, literals, comments or operators. A string may be perfectly legal as a literal but not as an operator; the type of token is relevant. However, the GREP-style Regex search is an easily customized way of indicating parts of SQL Code that need ‘checking out’.

What code smells do we need to find?

Flyway provides, as examples, a set of built-in rules, focused mostly on security and data protection, such as detecting DROP TABLE, overly permissive GRANT statements, or missing WHERE clauses in UPDATE or DELETE commands.

The built-in rules provide a foundation, but every team has its own standards for ensuring SQL is clean, efficient, and consistent across projects. They will quickly want to extend these rules to enforce and maintain their standards for:

  • Code style and maintainability – such as naming conventions, code headers and comment, avoiding SELECT *
  • Performance best practices – e.g. discouraging non-SARGable queries
  • SQL hygiene – e.g. catching forgotten PRINT statements or TODO comments
  • RDBMS-specific standards – SQL Server teams banning SET FMTONLY ON, PostgreSQL teams flagging USING in joins

It is important to note, though, that SQL is not a procedural language. Unlike procedural languages, which define a sequence of operations to be executed, SQL focuses on declaring what data to retrieve or manipulate and leaves the optimizer to determine the most efficient way to perform that operation. This means that team standards should not get too involved in code formatting as such.

Crafting custom regex rules

Writing effective SQL regex for is not always straightforward. SQL is a structured language, but regex treats it as plain text, meaning false positives can occur if you’re not careful. There are also subtle differences between various types of regex engine that can trip you up.

Flyway uses a Java regex engine, and there are a few nuances that might catch out any developers coming from .NET regex. For example, although Java’s regex engine works with single \ just like .NET, Java string literals require escaping \ as \\ before sending the string to the regex engine.

Consider the regex that Flyway uses to detect use of SELECT *:

When reading the rule from a TOML rule file, you need to escape the backslash because although Java regex works with a single \, but Java requires the double \\ for string literals, so Flyway (being written in Java) requires the double \\, to read in the TOML values correctly:

If you’re storing rules in a JSON file then generating TOML files dynamically (as I demo later) then you need to do double-escaping, because the backslash is a JSON escape character:

If you’re working with basic regex patterns, Java and .NET are nearly identical, but for complex parsing tasks, .NET is better due to its advanced lookbehind, balancing groups, and inline conditionals. Regardless of regex engine, there are also certain regex patterns to be aware of, such as use of greedy quantifiers like * (wildcard match), that in large scripts can cause excessive backtracking and performance issues.

Developing and testing your rules

I try out all my regexes using a Regex IDE (RegexBuddy), because that allows me to program in the Java dialect of Regex. I can develop the regex in the top section and use the SQL code in the bottom section to test ideas with some sample queries.

As an example, a development team might decide to create a Flyway regex rule to verify the presence of the structured headers in SQL migration files. These headers would detail the purpose of the migration script, the author, and the creation date. A sample regex pattern to check for these headers, formatted as block comments (/**/) at the beginning of a SQL script, could look like this:

Or, if the team prefer single-line comments:

This GREP-style regex search can be customized further to meet specific team requirements and help maintain consistent documentation across all migration files.

Here we are testing out this rule in Regex Buddy:

RegexBuddy

I have plenty of bad SQL, because I do a lot of trial work where I need to try out a database design very quickly. I don’t scrap the code because it is handy for this sort of task.

Building your custom regex library

It is a slippery slope to try to enforce too many rules, because nobody ever agrees on what standards are required. However, there are a few sensible regex rules you can enforce to ferret out some of the common code smell offenders, as well as to help maintain consistency across scripts, making it easier for teams to review and troubleshoot them.

To get you started, and to delve into the nuances or creating and maintaining new rules, I’ve created custom regex rules for a section of code smells from my SQL Code Smells book (there is also a GitHub site containing a GitHub version of the book). The following table summarizes those that we can sniff out with custom regexes. These are just examples of the sort of smells you can catch; they work but one or two (like “Non-SARGable Queries“) use wildcard (*) searches that may not perform well  on large SQL scripts. You’ll need to test and potentially refine these examples, based on the size and types of scripts you need to analyze.

You can find the individual rules files (.toml) in my Flyway GitHub project. However, I prefer to maintain the rules in a single JSON file and then auto-generate the individual TOML files using PowerShell (more on this shortly):

Name Dialects Rules (Regex) Description
SELECT Star All (?i)SELECT\s+\* Avoid using SELECT *; specify columns explicitly.
Implicit Joins All (?i)FROM\s+\w+(\s*,\s*\w+)+ Use explicit JOIN syntax instead of comma joins.
Deprecated outer join syntax Oracle, TSQL \(\+\) Old-Style Outer Joins – Proprietary syntax (e.g., Oracle’s (+)).
Unqualified Column Names in Joins All (?i)SELECT\s+?([]\[\w]+[,\s]+)+FROM\s+?([]\[\w]+\s+?)+JOIN If a query involves a join, it is safer to specify the table origin of a column
Be Explicit About JOIN Type All (?i)\bJOIN\b Use INNER JOIN, LEFT JOIN, RIGHT JOIN explicitly.
Avoid RIGHT JOIN All (?i)RIGHT\s+JOIN Prefer LEFT JOIN for better query readability.
Non-SARGable Queries All (?i)WHERE\s+.*(upper|lower|substring|replace|coalesce|mod|round|nullif) Non-SARGable Queries
UNION instead of UNION ALL All (?i)UNION(?!\s+ALL) Use UNION ALL unless explicitly removing duplicates.
No WHERE Clause on DELETE or UPDATE All (?i)(DELETE|UPDATE)\s+(?!.*\bWHERE\b) No WHERE Clause on DELETE or UPDATE
Primary Key Naming (PK_) All (?i)\bCONSTRAINT\s+(?!PK_)[a-zA-Z0-9_]+ Primary key constraints should start with ‘PK_’.
Foreign Key Naming (FK_) All (?i)\bCONSTRAINT\s+(?!FK_)[a-zA-Z0-9_]+ Foreign key constraints should start with ‘FK_’.
Index Naming (IX_) All (?i)CREATE\s+INDEX\s+(?!IX_)[a-zA-Z0-9_]+ Index names should start with ‘IX_’.
Avoid Short Table Aliases All (?i)FROM\s+\w+\s+[a-zA-Z]$ Use meaningful table aliases instead of single letters.
ORDER BY without LIMIT/OFFSET All (?i)ORDER\s+BY\s+[^;]*$ Inefficient for large datasets without limits.
SET FMTONLY ON (SQL Server) TSQL (?i)SET\s+FMTONLY\s+ON Deprecated in favor of sp_describe_first_result_set.
small VARCHAR sizes All (?i)VARCHAR\s*\(\s*[1-3]\d?\s*\) flag small VARCHAR sizes (suggest CHAR for small fixed-size strings)
ToDo_Test All (?i)toDo|toTest|tearDown possible unfinished alteration or Test artefact
End-of-line comments without space All –\\S end of line comments without a leading space (Style not error)

There are several ways where one might usefully expand this even further. We can, for example, also use regexes to search for deprecated features, or superseded features such as RAISERROR (we should use THROW) or CONNECT BY (rather than WITH RECURSIVE). It can be used with features that the team would like to discourage, such as ENUM and SET types.

Maintaining your regex library

Flyway doesn’t currently support TOML files that contain a collection of rules so each rule must be in a single TOML file. At the time of writing, anyone working on Windows or using PowerShell had to write rules explicitly in the Windows-1252 text encoding (or else UTF-8 without BOM) to avoid a complex Java error. This problem is fixed in Flyway 11.4.1 and later.

It is no fun to actively maintain a large set of rules when each rule requires its own file. Regex rules don’t just “stay written”. You have an infrequent but necessary maintenance role whenever you hit a shortcoming with a rule. To make this as easy as possible, I prefer to have a single place where I can store, edit, develop and export Regex rules, such as in a JSON document, or in an application that can output in JSON:

I keep all my custom regexes in a JSON document and generate these as single TOML rule files, within in a script, when I need them. I don’t add the built-in rules because those might be changed in an updated version. It is safer to load in the ones you want and check for any changes when you update.

Auto-generating rule files from JSON

Generating rule files from JSON

Instead of maintaining dozens of individual TOML files, we store all rules in a single JSON file and generate TOML files dynamically when needed. In my Flyway Teamwork GitHub project, I’ve provided a single Rules.json file, containing a selection of custom regex rules from the previous table.

We store this single JSON store of rules within the Flyway project, in a subdirectory of the main branch. I use ‘Scripts‘ but its name will be defined in the $ProjectLevelResources variable. Don’t put it in your rules directory because Flyway tries to read any file, not just a TOML file, and then ‘errors out’ because it’s JSON not TOML.

The following PowerShell script will read the single JSON file and write each rule as an individual TOML file to the Rules directory (ensuring Flyway-friendly encoding in Windows-1252). It deletes any existing version before saving the new one.

Adding and editing rules

When we want to add a new rule to the archive or edit an existing rule, we do it in JSON:

Editing Rules in JSON

We then use the previous PowerShell to regenerate the individual files. Each backslash character in the regex has to be repeated four times to get past the Java import routines, as I described earlier.

If Flyway progresses in later releases to allow all the rules in one TOML file, we can easily generate the TOML from our JSON file of rules, like this:

I’m using the convertTo-TOML from the Flyway Teamwork framework in GitHub.

Testing out our rules

First copy all the built-in Flyway rules from the install directory to a Rules subdirectory in the Flyway project. This must be accessible from all branches of your project. If you’ve added new rules to your JSON rules document, or edited existing rules, you must remember to run the PowerShell script to regenerate TOML files, as described above. To test the rules, I’m using the same SQL assault course that I provided in the previous article. It is geared for SQL Server. Just save it to the migrations folder, with a version number in the name that is higher than the current version.

We’re now ready to run Flyway’s check -code command and inspect results.  Flyway needs to know what dialect of SQL you’re using, so we need to create an environment (the connection details for your database). Flyway will get the dialect from the URL of the current environment to work out which dialect of SQL you’re using. Please refer to my previous article or to the documentation for further details.

I’m saving the connection details and credentials for each branch database in individual TOML filers in the secure user area in Windows. If you are using TOML environments, then instead you’ll just need to define your database ‘environment’ in the flyway.toml file, indicate which environment Flyway should use as a parameter or environment variable, and use resolvers on credentials or connection details within the flyway.toml file.

So, we run the dummy SQL files and get the following violations reporting in the code smells report.

Code smells report

You’ll see that the $SQLSmells array can be checked by any pipeline process, or fed into an alerting system before the migration is run. There is also an HTML report generated for Flyway Desktop, if prefer that for visual checks. I prefer to be notified via the iPhone, so I’ll settle for the ‘SQL Code smells’ output in CSV or JSON format that can be read by a notification system.

Conclusion

I hope that I’ve given you a flavour of my views about a useful tool that replaces all the GREP clutter that we seasoned database programmers tend to accumulate. What’s there to like? Well, you get a checklist of SQL code that might just need a bit of tidying up before your careless code gets enshrined in a migration. Because the checklist is in a structured document (JSON) it can be read accurately by other participating CLI applications in the release pipeline.

Unlike the SQL Fluff code analysis, a Regex check can be configured exactly to the requirements and styles agreed by the team doing the development. This will include security and operational concerns as well as development issues. Sure, this iteration of the tool has a few rough edges, such as the strange requirement for the text encoding of the TOML rules, but there is a place for a simple and configurable system for doing regex searches for coding issues.

Tools in this post

Redgate Flyway

DevOps for the Database

Find out more