Formatting SQL Code – Part the Second

When you're formatting SQL Code, your objective is to make the code as easy to read with understanding as is possible, in a way that errors stand out. The extra time it takes to write code in an accessible way is far less than the time saved by the poor soul in the future, possibly yourself, when maintaining or enhancing the code. There isn't a single 'best practice, but the general principles, such as being consistent, are well-established. Joe Celko gives his take on a controversial topic.

A few decades ago, I wrote SQL columns for the two trade press in-newsstand database magazines that were current at the time. The two titles were “Database Programming and Design” and “DBMS”, from Miller Freeman and CMP publishers respectively. In 1999, Miller Freeman merged with CMP Media (a division of United Business Media plc) and the combined publication became “Intelligent Enterprise”, which ceased publication in February 2007.

At one point, Chris Date and I had columns in the two magazines, opposite each other. We were sort of the RDBMS version of the 1937 mock radio feud between Fred Allen and fellow comedian Jack Benny.

It was at this point that we needed a standard SQL code format that would be used as a standard across the publications, and then later could be used in books. Chris and I, with the help of the editors, set to work to create a standard for code that would make it easier for our readers to understand what we were arguing about. The exercise was quite a learning-experience, even though I did some work at AIRMICS (Army Institute for Research in Management, Information and Computer Science) and already had a fair amount of research material on code readability, debugging, and the other bits and pieces of software engineering (I actually started writing columns on software engineering in the trade press before I became a SQL guru).

You might have thought that it would be easy to come up with a standard for laying out code so readers could ‘take it in’ easily. We couldn’t just clone existing standards from another language because SQL is unusual for several reasons: SQL was designed to have an unambiguous grammar, allow for multi-word ‘keywords’ (i.e., “INSERT INTO”, “GROUP BY”, “ORDER BY”, etc.) and to allow unary operators in math (leading negative and positive signs in expressions). We wanted a computer language that was close to the English language you’d naturally use for set-based operations, yet one that was easy to parse unambiguously. To keep us on the right track, one of the members on the original ANSI X3H2 committee had access to a parser generator, a really good one. Anytime we, in the committee, would pass a paper, Steve would run it through his tool at work and bring back the results to the next committee meeting. Our goal was explicitly to create an LALR(1) grammar. This would make it possible to build a simpler parser. Yes, we actually worried about stuff like this, but that is why SQL is such a great language.

If you would like some of the details, then read this article: https://en.wikipedia.org/wiki/LALR_parser

If you would like something with more details, then look at this article: http://web.cs.dal.ca/~sjackson/lalr1.html

If you really do not care about writing compilers, then the ‘Management-Overview’ version is that this grammar leads to certain typographical conventions that would not be otherwise necessary. The multi-word ‘keywords’ are unusual in programming languages. We are also the only language that has embedded-in keywords within a parameter list. If you are writing T-SQL, you will use “CAST(<expression> AS <data type>)”, but will not use the ANSI/ISO standard “SUBSTRING (<string expression> FROM <start position> FOR <length>)” and other options that do not match the old Sybase syntax.

All these factors mean that you can assume that you can transfer your ‘best-practices’ unchanged from other languages. SQL is unusual.

Here are my rules for formatting SQL code, with comments that link back to the principals involved that I outlined in the first part of this series.

Mother Celko’s SQL Formatting Rules

  1. Do not use color to display the code (Stroop effect). Besides making the code harder to read, you just cannot go throwing around eight or so different colored inks in a publication!
  2. Use a monospace font. This will let you create rivers to make the code easy to read vertically. I will get back to this later.
  3. The worst thing to do is to be inconsistent. There are many ways to be inconsistent. A data element should have one and only one name in the schema and preferably in the universe at large. This is why we build data dictionaries. Otherwise, your schema will fill with crap like “account_number”, “account_nbr”, “acct_nbr”, “acct_num”, and so forth.
  4. In general, avoid using abbreviations and defaults. The full keywords are easier to read and we do not have to save space on a punch card anymore. For example, write out DECIMAL(s,p) instead of DEC(s,p); INTEGER instead of INT; “CASE.. ELSE NULL END” instead of letting it default to an implied else clause.
  5. Uppercase all of the reserved words in SQL (Bouma principle) so they can be clearly seen. This is surprisingly easy because the ANSI/ISO SQL standards actually have a list of all the reserved words that you can cut-and-paste. You probably wanted to add vendor specific reserved words and you have to get those from your vendor’s documentation.

    There is another advantage to this. When someone uses a Pretty Printer with this list, and they have used a reserved word as a column or data element name of some kind, it will be flagged in the formatting by being shifted to uppercase. In particular, the newbie will see “DATE”, “VALUE” and other generic data element names sticking out of his code like a sore thumb. He can then go back and fix his bad design by following ISO 11179 rules.

  6. Lowercase all scalar values and column names. This goes back to the newspaper readability rules about lowercase type. When I see “SELECT distinct”, it looks like “distinct” is a column name and not a reserved word.

    You might notice the use of the underscore. ISO likes this because it is part of a required subset in Unicode of punctuation marks for representing ISO standards. Typographers like it because it looks like the baseline that we been writing on in our loose leaf binders and parchment manuscripts for centuries. Latin alphabet users will flow over the underscore, but camelCase and Pascal case, which use an uppercase letter to separate pieces of a name, make the eye jerk and stop. We found that this sort of thing can add 8 to 12% more reading time to debug code.

    The ISO-11179 rules use a “[<role>_]<attribute>_<attribute property>” syntax in which the <role>

    tells us how multiple copies of the same data element are used within the same scope. The <attribute> is the name of the data element involved, and the <attribute property> is drawn from a small vocabulary that classifies what the attribute is doing. For example, if I have a table called “Tree”, this would tell us that I have only one tree and it is so unique, so well-defined that I do not need to specify anything else about it. Clearly this is stupid; what we wanted for the table name was a plural or collective noun.

    The name “Trees” is weak, and I probably wanted the collective noun “Forest” instead. Within that table, I might have columns named “tree_id”, “tree_diameter”, “tree_species” and so forth.

    Please notice that an attribute property cannot be used as an attribute. This is why names like “code_id” are a code smell. This names are simply too generic to have any meaning (the Law of Identity from formal logic). Years ago in an article I made fun of this sort of thing by stringing together three or four attribute properties, as a reducto ad absurdum. The next year. I found someone had actually done this in real code.

  7. Capitalize the names of schema objects that are constructs, such as tables, views, sequences, stored procedures and so forth. Remember that users of Latin alphabets read capitalized words as proper nouns or in some way special. They will not confuse them with the scalar values in SQL.

    This convention also does some other work for you. If I have a table named “Employee” and a column named “employee”, this is legal SQL. But it is really, really bad coding. We should have made multiple name spaces when we were writing the standards for the language back in the 1980’s, but we did not.

  8. In general, avoid using abbreviations. The full keywords are easier to read and we do not have to save space on the punch card anymore. For example, write out DECIMAL(s,p) instead of DEC(s,p); INTEGER instead of INT; “CASE.. ELSE NULL END” instead of letting it default to an else clause.

    Consistency is very important. Do not mix abbreviations with full words in your code. For example, if you use LEFT OUTER JOIN, then do not use LEFT JOIN anywhere in the program text. Likewise, if you use LEFT OUTER JOIN, then use RIGHT OUTER JOIN and FULL OUTER JOIN to match the style.

  9. Do not indent more than three spaces within a block (Law of Proximity). The readers I will get lost if he has to make huge jumps across the page. Likewise, do not set text in long lines. This is why newspaper columns are made up of relatively short lines of text. The rule of thumb in typography was that a newspaper or magazine column should be no more than 2 ½ times the length of the alphabet sets in the type font being used.

    If you have a long line of code, then split it on multiple lines. Sometimes a small indent of one to two spaces on the second and following lines will help. But be sure to end the line where the split occurs with language token that lets the reader know that there is more to come. This is why putting a comma at the front of the line is a bad practice; when you see a comma you know there is something coming after it. Likewise, when you see a semi-colon, you know it is a terminator. That lets the reader mentally close up that unit of code, “clear his buffers” and parse the next statement.

An Example

We’re going to start with code that hasn’t been formatted well, and improve it in stages. I want to use an example that is well understood, so I picked relational division. Relational division is one of the eight basic operations in Codd’s relational algebra. The idea is that a divisor table is used to partition a dividend table and produce a quotient or results table. The quotient table is made up of those values of one column for which a second column had all of the values in the divisor.

We have a table of pilots and the planes they can fly (dividend); we have a table of planes in the hangar (divisor); we want the names of the pilots who can fly every plane_name (quotient) in the hangar. To get this result, we divide the Pilot_Skills table by the planes in the hangar.

Here is the worst DDL and DML I can imagine:

It is in the proportional font, in all uppercase, crammed together as a continuous unreadable blob of text. There are needless parentheses, too. If I forgot an error, it is only because I was getting sick while doing this. Please don’t think I’m exaggerating: I have actually seen stuff like this posted on forums and in production.

Change Font, Add Line Spacing and Capitalization

Let us make a couple of basic changes. First, use Courier New for the text. Next, lower case column names and scalar variables, but keep reserved words in uppercase. Let us actually space out the text; the space around the equal signs, a space after commas, etc. Basically, follow the rules you would for English text; that is how people have been trained to read Latin alphabet text.

Next, we capitalize the table names. This will make them be seen as “special units of text” and not scalar or column names. Notice that the open parentheses in a data declaration are placed next to the keyword of the data type. This is the Bouma principle; they are actually part of unit. Another thing about open parentheses is that they announce subclauses in SQL. For example, a subquery is always enclosed in parentheses.

Overuse of parentheses is often a sign that the programmer came from a different language, with different parsing rules. Or that he has crammed his code text so close together, he needs to have a physical separator rather than being able to easily read it.

Languages which require a lot of parentheses are usually low level systems languages (LISP is the exception). Overusing parentheses results in a really awful debugging session. You will need to get a highlighter, and highlight matching pairs of parentheses. A good SQL programmer will use spacing to aid the debugger (remember someday you will be the poor bastard who has to go back over sloppy code; a professional writes code that is merciful to those that come after him).

This is clearer, but we’re not finished yet.

Add the Rivers and Indents

We come to the tricky part in formatting SQL code. We need to add rivers and indentation because SQL code is two-dimensional, not one-dimensional like text. Look at the Pilot_Skills table; it has two columns shown on one line each. The definition of a simple column should be written on one line because they are easily read as a Bouma. The PRIMARY KEY constraint is indented just one space to show that it is a separate unit of code. The close parentheses and semi-colon can be put together because they are terminators that tell the reader where the end of a unit of code is.

There is a line space between the table declarations to give clear separation. Again, the simple single column can get all of its information in one line. If we had CHECK() constraints, REFERENCES, DEFAULT or other DDL clauses, we would have gone to a new line and indented a single space.

Finally, the DML query needs explaining. The keywords SELECT and DISTINCT are together because they are actually a single unit in the SQL grammar. The outermost SELECT-FROM-WHERE construct needs a river to “anchor” those three clauses vertically

When we drop down into the level of the outermost WHERE clause, we have a NOT EXISTS() predicate. This is one of the few predicates in SQL that has a query as its parameter. Subqueries are always enclosed in parentheses – – we made sure of that when we were defining the ANSI standards. So to be consistent, the parameter must be formatted with the same convention of a river on the SELECT-FROM-WHERE construct (the consistency principle). Repeat this logic for the next nesting of an EXISTS() predicate.

Please notice how the “AND” appears at the left-hand shore of each river in its subclause. If you put it at the end of the line, the reader would expect to see something immediately after it. This is a word that announces something to follow, rather than terminating something.

Notice how the keyword GROUP is on the left-hand shore of the river, but the other part of the syntax, BY, goes on the right shore. These extra words are a peculiarity of SQL grammar and do not always add meaning to the code. I like to think that they add to the readability of the code.

Here is the final formatting:

I think you will agree that this is considerably easier to read, and therefore to understand, than the original solid blob of SQL.