SQL Code Layout and Beautification

William Brewer takes a look at the whole topic of SQL Code layout and beautification, an important aspect to SQL programming style. He concludes that once you are tired of laying SQL out by hand, you had better choose a tool with plenty of knobs to twiddle, because nobody seems to agree on the best way of doing it

The problem

Badly-formatted SQL is the bane of my life. That is because my life is spent staring at SQL Code. Other peoples SQL code often looks ridiculous to me. As there is no common standard for laying out TSQL, even for SQL Server, we all have strongly-held opinions on how to lay out code. I’ve yet to find two database developers who agree in detail on how it should be done. I’ve seen entire projects written in lower case, with Javadocs headers, others done entirely in upper case with hardly any line-breaks.

The reason that every database developer has such strongly held views is that he gets used to a particular format. I find that I can read SQL code that is laid out to my taste far faster than SQL code written inconsistently or to one of the wilder fringe-formats. Actually, it is worse than that, going back to look at old code that I’ve written can be a painful experience if the old code is written to a style that I’ve since discarded.

The worst experience a database developer can suffer is to be obliged to support a SQL application that has been written by someone else. It is difficult to look a programmer in the eye after you’ve had to maintain his SQL code. A SQL Code Beautifier is the only way to stay sane and keep friendships.

Some solutions

I was brought up with CB, the C beautifier, and BCPP the C++ beautifier. Almost every language has a beautifier. Some text editors such as Visual Slickedit even have beautifiers built-into them. I’ve never found a quick and simple way of beautifying SQL code. Just in case you don’t know, a code beautifier just makes the best of what you produce. It doesn’t criticize. A ‘Lint’ will tell you what’s bad about your code, telling you about things like deprecated syntax, unused or uninitialised variables, errors, dangerous techniques and so on.

There are a couple of unique problems with beautifying SQL.

It isn’t easy to lay out SQL code since the language was designed to be easy for mere mortals rather than conformant to the standards of procedural languages: an example of this is that certain keywords mean different things in different contexts. The only way for a utility to understand SQL well-enough to lay it out to reflect its meaning is to parse it. A TSQL parser isn’t the sort of thing you knock off on a wet Friday.

A lot of shops, especially in the larger enterprises, have existing standards for SQL code, and any beautifier has to be configurable to comply with these standards. Besides Joe Celko’s brave attempt at publishing a standard for the layout of SQL code, you’ll find that code that meets the standards of one company is no dice with another one.

What tools are there?

SSMS/Query analyzer add-ins

I challenge you to find a better TSQL code beautifier than Red-Gate’s strangely-named SQL Refactor. This is a fairly easy challenge because there are very few others on the market, and many of them seem to die in mysterious ways, crash the system, or have stupid restrictions on their evaluation versions. One recent arrival is SQL Enlight, which is surprisingly like parts of  SQL Refactor’s beautifier, It needs work before it is serviceable, although it is improving rapidly,   Its worst fault is that it makes silly mistakes such as trying to layout strings and comments with tabbification! It has the start of a Lint facility as well as a beautification, but it is not as advanced as SQLTAC’s.

I suppose I’m easily confused, but I originally came across the Beautifier in SQL Refactor almost by accident. I’d been using SQL Prompt as a sort of primitive beautifier in the Query Analyser  when I programmed in SQL Server 2000.  (It will do simple things like putting keywords in upper-case). SQL Refactor wouldn’t work in the Query Analyser whereas SQL Prompt worked with everything, so I didn’t then pay SQL Refactor the attention it deserved. When I got to SSMS and SQL Server 2005, I discovered the delights of SQL Refactor, and so I stopped using SQL Prompt. SQL Refactor did everything I wanted.

Red-Gate have, at long last, released a version of SQL Prompt, SQL Prompt 3.8 Pro, that has SQL Refactor’s excellent Beautifier. It will now lay existing SQL code, as well as the code you are typing in. This is more exciting than you might think, because SQL Prompt will work in Visual Studio and Query Analyser as well as SSMS, so all those poor folks stuck in those corporates that still haven’t moved to SQL Server 2005 will, at last, have a decent SQL Code formatter!

Online tools

Wangz Instant SQL Formatter. This performs the function of Beautification and prettification. It has the great advantage of being free. SQLInformm is another one which has interesting features.

There is some confusion between a Beautifier and a Prettifier. A Prettifier, such as  Simple-talk’s Prettifier, is a program that renders a block of code so it can be published. (printed, put in a website, or intranet, or into PDF). It doesn’t attempt to rearrange the code but merely render it as the programmer intended.

The problem with online Beautification is that it is more of an effort for the programmer. There is a lot of difference between merely clicking a menu item within your chosen IDE, and having to cut and paste into a website. The worst problem, though, is that they generally don’t remember your detailed preferences between sessions.

The task

So what does a beautifier do?

Any SQL Code beautifier must parse the code to work out what each token means in its context. (AND , or a bracket, for example) It then lays out SQL code as follows:

  • To maintain a consistency in the way that space is laid out between statements.
  • To group together distinct parts within the statement. Simple statements are generally moved onto a single line and complex statements are divided up logically so that each part of the statement is separated with a line. Any white space preceding a semicolon that ends a statement is generally removed.
  • To enforce consistent indentation
  • To enforce whatever rules you decide for the capitalisation of tokens

The problem with attempting any general rules about layout is that SQL plays a range of roles, such as DML and DDL. Not only that but it allows complex expressions and supports both blocks and conditional statements. The rules you use for laying things out will be different for different purposes. Not only that, but because a single Select statement can go on for pages or could be no more than a handful of tokens, your rules need to be different for different types of Select statement.

A good example of this problem is the way that parentheses are used. In SQL, parentheses are used in a number of different contexts, such as mathematical expressions, schema statements, function parameters, derived tables and subqueries. For the purposes of layout, it is important to distinguish the types of parentheses. Where, for example, parentheses are used in expressions to enforce the order of evaluation, or for the parameters of mathematical/logical functions, you might wish to treat them differently from the parentheses that delimit a subquery. If you try to create general rules for parentheses, then your code is bound to look odd.

Fortunately, there are a few general rules about beatification that we can deal with.

General rules

Commas

Commas should be used in exactly the same way as in English grammar, there should be no space before the comma, and one space after the comma to separate it from the next token.


SELECT Orders.ShippedDate, Orders.OrderID, “Order Subtotals” Subtotal,
        
DATENAME(yy,ShippedDate) AS YEAR
  
FROM Orders

Commas should never be at the start of lines. Surprisingly, there will be those that disagree, and so it is possible to have a space before the comma, start lines with commas and so on. You can even have the commas lining up with the start of the select statement if the mood takes you. I’ve come across SQL Layout conventions that require that the comma is left aligned with the statement start or that the comma is placed close to the column. It should be possible to specify the minimum number of items that the list should contain before it is split into several lines, or the maximum number of columns before a line break is inserted after, or before, the comma.

Indentation

When formatting multiple-line statements, the first line is normally not indented as far as the subsequent lines, so it is easy to see the start of the statement..


SELECT Employees.Country,
      
Employees.LastName,
      
Employees.FirstName,
      
Orders.ShippedDate,
      
Orders.OrderID,
      
“Order Subtotals”Subtotal AS SaleAmount
  
FROM Employees INNER JOIN
    
(Orders INNER JOIN “Order Subtotals”
            
ON Orders.OrderID = “Order Subtotals”OrderID
        
)
  
ON Employees.EmployeeID = Orders.EmployeeID
  
WHERE Orders.ShippedDate BETWEEN @Beginning_Date AND @Ending_Date

Statements that contain a block of SQL code to be executed (such as IF statements) are indented to group the block.

You can set up the number of spaces to use for a tab stop, though conventional wisdom suggests that three spaces is the best setting. Some people like to indent statements by the number of the characters in the initial keyword plus one character, though this seems to me to spread code too widely across the page without adding much to its clarity. It just panders to the mystical urge to make things line up.

Wrapping

Just like the text on the page, SQL code is likely to need to be wrapped. I generally like to wrap SQL code at around the 80 column point, just so it is visible on the printed page when printed in 12 point Courier

Capitalisation rules

The tokens in SQL Server can be categorized as follows

  • Reserved words (e.g. SELECT, DECLARE, MESSAGE_LENGTH, ROLLBACK)
  • Scalars (column names, parameters, variables)
  • Schema object names (names of tables, views and stored procedures)

Although you will find fanatics who will insist that everything should be in lower case, it is more likely that the three different types of token will need to be handled differently. A Beautifier should allow the user to specify the way that you apply rules about use of capitals and lower-case.

I’d like to be able to choose from the following options

  1. Make it lower-case
  2. Make it upper-case
  3. Make it lower-case but for an initial capital
  4. Take the style from the initial declaration or existing database schema.
  5. Use the style from the schema, but enforce an initial capital and a capital after a _ character

You’ll note that I don’t mention CamelCase because there is no way of enforcing it. If, however, your definition of a stored procedure goes like:

ALTER PROCEDURE [dbo].[SalesByYear]
  @Beginning_Date
DATETIME, @Ending_Date DATETIME AS
—-do something clever here

Then every time your code mentions the procedure, the capitalisation should be changed to the way it is initially done in your definition, along with its variables. (SQL Prompt is helpful for doing this)

So that…

EXECUTE salesbyyear @beginning_DATE=‘1 jan 2000’,
              
@end_DATE=‘1 jan 2000’

…Gets changed to…

EXECUTE SalesbyYear @Beginning_Date=‘1 jan 2000’,
              
@Ending_Date=‘1 jan 2000’

The same should go for any other DDL

My take on the rules is as follows.

Reserved words.
These should be in upper case so as to make it easier to pick them out from a large block of SQL..
Scalars.
These look better in Lower case, unless they represent an abbreviation
Schema Object names
I prefer CamelCase for object names, with an initial capital, because they ought to be one word and one can pack a lot more readable description into CamelCase whilst making it obvious that it is one token.

Beautifying expressions.

You may want to reformat string, or numerical, expressions in DML statements. It helps to distinguish this sort of expression from the conditions in a ON, WHERE, HAVING clause because of the particular keywords such as BETWEEN, and the slightly different syntax.

Operator placement

Some styles demand that a line be broken when a particular operator token (e.g. AND, OR or NOT) is used. If so, then the break should be either before, or after, the token, depending on the style chosen

Operator spacing

Ideally, we should be able to decide on separate rules for arithmetic, comparison, unary, assignment and logical operators. These should, some believe, have a space before and after them. It should be possible to leave them alone, to enforce spaces or to remove them. Should there be a line break before or after them, and should the line break only happen at a particular nesting level in the expression?. What should be the minimum number of logical operators in a logical expression before the line break is inserted, and which type of logical operator should trigger a line-break?

Parentheses

Opening Parentheses within expressions should, in some layouts, be on a new line. However, it could be that the contents of parenthases should be on a new line, and suitably indented by placing the line break after the opening parenthesis The closing parentheses can be handled in a number of ways. The closing parenthesis is placed on a new line, or the line break is added after the closing parenthesis. There should be an option to leave the formatting of parentheses alone.

If parentheses are given a line-break, then how should the contents be indented?

Now this isn’t particularly straightforward. How should you format the following SQL Statement which is basically a mathematical expression? I doubt if you will find two programmers to agree, so the rules have to accommodate this.

— Calculate the Correlation between two normally-distributed
— variables contained in two columns in a table
SELECT Correlation=
    
(
  
SUM((FirstVariable*1.0 AVG(FirstVariable*1.0))
            * (
SecondVariable*1.0AVG(SecondVariable*1.0)))
       /
COUNT(*)
  )
    / (
    
STDEVP(FirstVariable*1.0)*STDEVP(SecondVariable* 1.0)
    )
FROM MyTable

Beautifying Schema (DDL) Statements

There should be a number of options that act on data statements such as SELECT, INSERT, DELETE, and UPDATE

The length of a data statement will vary enormously, and the way that you will want to format it will be different according to its length. There will, at least, be a length below which the statement should be in one line.

Parentheses

How should parentheses and definitions in CREATE and ALTER statements be formatted? Should opening parentheses be on a new line? Should the first definition be placed on the line below the opening parentheses. If you believe that either is correct then should you indent the parenthesis and the definition, the contents of stored procedures, functions, DDL triggers, or DML triggers, or should you align them with the start of the statement?.

You should be able to opt to pad the inside of parentheses with spaces, or remove any existing padding. It is better not to add padding inside parentheses, around function parameters, column alias lists, and data types.

Column placement

It is often the practice to place the first column of a list on a new line, and many who will argue that every subsequent column in the list should be on a new line. I would only add a new line if, by leaving the column list item where it was, it exceeded the column width, in much the same way that text is wrapped. Some people argue that one should line up the columns on the end of the Select keyword. I find this a strange practice quite different from general programming practice, but any Beautifier has to accommodate the practice.

There is also the issue of how commas are treated. I prefer to handle the same way as in written English, but some people like to insert a space before, as well as after, a column so as to make each column definition stand out.

Join placement

The keywords that form the various parts of a multi-line DML expression should line up if they are on separate lines so it is easy to read the structure of the statement. In particular, the JOIN keyword and the FROM keyword should align, and the FROM should generally be on a new line. The JOIN conditions should probably be on a new line indented from the JOIN keyword,

You may want to specify whether the Join condition is placed on a new line or on the same line as the join. And whether it is indented. Should it be aligned with the statement. Or the table sources in the join list. How are you going to tackle the indentation of nested joins?

Beautifying DDL (Definition) statements

Schema statement layout rules control a set of options that can modify the way schema statements are written. Actually, I would only include tables and views in this as these are the only cases where special formatting is required.

Column definitions

The main formatting issue that is special to DDL is the list of columns. How should they be indented? Because they are never nested, this is a case where I prefer to indent rather more generously than in a routine, where there is likely to be far more levels of indentation.

With table definitions, it it is customary to always enforce a line break after each column definition . However, this is not always the case.

Parentheses

How should parentheses and definitions in CREATE and ALTER statements be formatted? Should the opening parentheses be on a new line? Should the first definition be placed on the line below the opening parentheses? If you believe that either is correct then should you indent the parenthesis and the definition, the contents of stored procedures, functions, DDL triggers, or DML triggers, or should you align them with the start of the statement?.

You should be able to opt to pad the inside of parentheses with spaces, or remove any existing padding. It is better not to add padding inside parentheses, around function parameters, column alias lists, and data types?

Here is how I’d choose to format a simple table definition. I doubt if you’d get even a majority to agree that this is the right way

CREATE TABLE dbo . Categories
  
(
  
CategoryID INT IDENTITY(1,1) NOT NULL,
  
CategoryName NVARCHAR (15) NOT NULL,
  
Description NTEXT NULL,
  
Picture IMAGE NULL,
  
CONSTRAINT PK_Categories PRIMARY KEY CLUSTERED
    
(
      
CategoryID ASC
    
)
  
WITH
    
(
    
PAD_INDEX = OFF,
    
STATISTICS_NORECOMPUTE = OFF,
    
IGNORE_DUP_KEY = OFF,
    
ALLOW_ROW_LOCKS = ON,
    
ALLOW_PAGE_LOCKS = ON
    
)
  
ON PRIMARY
  
)
ON PRIMARY TEXTIMAGE_ON PRIMARY

This is why it is so important that any SQL Beautifier should be able to have lots of knobs to twiddle to allow you to get it how you want it.

Conclusions

I can’t see any chance in a general consensus on how we should format SQL. There isn’t even a real detailed consensus in languages like Java. In the meantime, SQL Beautifiers that lay out SQL in a particular style, and to particular rules, should allow the programmer to change the rules to suit him/herself and to come with a whole range of ready-made styles. I’d suggest that the same is true of a SQL Lint, when anyone is ever brave enough to come out with one.