When you learn a new language, you tend to keep the grammar of your old language and just change the words. Consider German versus English word order; “Bitte mich morden nicht”, which is literally, “Please me murder not”, but better expressed as “Please don’t murder me” in English. English really likes a <subject><verb><object> structure to its sentence. German tends to use <subject><object><verb> structure in its sentence.
In Esperanto, we have “Bonvole, ne murdas min” or “Bonvole, min ne murdas” since the parts of speech are shown by suffixes (-n = object, -e = adverb, -as = present tense verb). When you see the first version, the word order gives you a strong hint that this Esperanto student was an English speaker; the second form implies a German speaker.
The same principle applies to SQL students. They make systematic errors based on their prior languages. Random errors are rare.
Procedural Control Flow versus Declarative Code
Classic Structured Programming is based on three constructs; IF-THEN-ELSE, WHILE-DO and BEGIN-END. If you are a math major, you know these constructs match to operators in primitive recursive function theory. You can compute certain classes of algorithms with primitive recursive functions. If you are not a math major, then you don’t care about all of that; you are happy that these constructs let you write neat, clean programs that are easy to write, easy to optimize and easy to maintain (if you do it right).
Structured Programming is the only way most programmers have coded for the last 40 years, so they don’t give up that mindset easily. Just Google any SQL forum for the phrase “CASE Statement” and you will lots of hits. There is no such thing in SQL; we have a CASE expression instead. An expression returns a scalar value and has nothing to do with control flow. Here is a typical example of one of these postings.
1 2 3 4 5 6 7 8 9 |
SELECT author_name, book_title, book_price FROM Bookstore_Inventory AS I WHERE CASE WHEN @in_shop_category = 'Mysteries' THEN I.author_name -- wrong! IN (SELECT S.author_name FROM Best_Sellers AS S WHERE S.shop_category = @in_shop_category) ) ELSE I.shop_category = @in_shop_category END; |
The CASE was supposed to change the logic in the WHERE clause. The intent was probably something like this:
1 2 3 4 5 6 7 8 9 10 |
IF @in_shop_category = 'Mysteries' SELECT author_name, book_title, book_price FROM Bookstore_Inventory AS I WHERE I.author_name IN (SELECT S.author_name FROM Best_Sellers AS S WHERE S.shop_category = @in_shop_category) ELSE SELECT author_name, book_title, book_price FROM Bookstore_Inventory AS I WHERE I.shop_category = @in_shop_category; |
You will also see IF and IFF() dropped into SQL in the hope that the compiler somehow knows languages that use those constructs.
Lasagne versus Spaghetti Code
This is based on a posting on a forum. The problem was to read in raw data. The records came in three types (header, body and terminator) which have to be grouped together, then loaded as a complete fact in a single row of twenty columns. The gimmick is that the rules for assigning values to the columns are tricky and depend on each other.
The approach the poster used was classic COBOL. He declared twenty local variables for each output column, split the input into that working storage, and then wrote IF-THEN-ELSE logic with assignment statements. When the row was ready, he wrote a loop to insert it into the output table.
Here is a skeleton of some of the code. Do not worry about the particulars; look at the basic approach.
1 2 3 4 5 6 7 8 9 10 |
/* statement to edit field 3*/ IF @local_col_1 = 'F3' BEGIN IF B.rec_nbr = 1 AND B.msg_type = 3 SET @local_col_3 = 2 ELSE IF B.rec_nbr > 1 AND B.msg_type = 3 SET @local_col_3 = 3 ELSE IF B.rec_nbr > 1 AND B.msg_type = 4 SET @local_col_3 = 4 END; |
The most obvious fix is to mechanically re-write the IFs to CASE WHEN and add an END for each CASE. A fairly simple parser could this for you.
1 2 3 4 5 6 7 8 9 10 11 12 13 |
/* statement to edit field 3*/ SET @local_col_3 = CASE WHEN @local_col_1 = 'F3' THEN CASE WHEN B.rec_nbr = 1 AND B.msg_type = 3 THEN 2 ELSE CASE WHEN B.rec_nbr > 1 AND B.msg_type = 3 THEN 3 ELSE CASE WHEN B.rec_nbr > 1 AND B.msg_type = 4 THEN 4 END END END END; |
Oops! That will not work. Each of those CASE clauses has an implicit “ELSE NULL” clause before the “END” terminator. Try again, with a “do nothing” clause for safety.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
SET @local_col_3 = CASE WHEN @local_col_1 = 'F3' THEN CASE WHEN B.rec_nbr = 1 AND B.msg_type = 3 THEN 2 ELSE CASE WHEN B.rec_nbr > 1 AND B.msg_type = 3 THEN 3 ELSE CASE WHEN B.rec_nbr > 1 AND B.msg_type = 4 THEN 4 ELSE @local_col_3 END EN END ELSE @local_col_3 END; |
In 1982, at the start of the Structured Programming Revolution, I coined the term “Lasagne code” as opposed to “Spaghetti code” that had appeared in the literature.
Spaghetti code was defined as code with so many GOTO statements in it, you could not follow the execution path. You had no idea what the state of the program was at any particular time. Lasagne code was IF-THEN-ELSE, switch and other control flow GOTO-less structures nested inside each other that you also lost track of the the state of the program.
You can un-layer lasagne code to make it more readable. The layered code was probably faster, but not that much faster in the majority of cases. But see how much easier to read and maintain this version is:
1 2 3 4 5 6 7 8 9 10 11 |
SET @local_col_3 = CASE WHEN @local_col_1 <> 'F3' THEN @local_col_3 -- do nothing, this is not field #3 data WHEN @local_col_1 = 'F3' AND B.rec_nbr = 1 AND B.msg_type = 3 THEN 2 WHEN @local_col_1 = 'F3' AND B.rec_nbr > 1 AND B.msg_type = 3 THEN 3 WHEN @local_col_1 = 'F3' AND B.rec_nbr > 1 AND B.msg_type = 4 THEN 4 ELSE @local_col_3 -- do nothing for safety END; |
In this particular example, we can squeeze the code a bit more.
1 2 3 4 5 6 7 8 9 |
SET @local_col_3 = CASE WHEN @local_col_1 <> 'F3' THEN @local_col_3 WHEN @local_col_1 = 'F3' AND B.rec_nbr = 1 AND B.msg_type = 3 THEN 2 WHEN @local_col_1 = 'F3' AND B.rec_nbr > 1 AND B.msg_type IN (3,4) THEN B.msg_type ELSE @local_col_3 END; |
The point is that you can read each WHEN clause and see what the business rule is without having to move your eyes back to a containing CASE expression. As the optimizer gets smarter, it will handle the CASE expressions for you, just as procedural optimizers learned to improve loops, find dead code and do other good things for us.
CASE Expressions
It allows the programmer to pick a value based on a logical expression in his code. ANSI stole the idea and the syntax from the now-defunct ADA programming language. Here is the syntax for a <case specification> :
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
<case specification> ::= <simple case> | <searched case> <simple case> ::= CASE <case operand> <simple when clause>... [<else clause>] END <searched case> ::= CASE <searched when clause>... [<else clause>] END <simple when clause> ::= WHEN <when operand> THEN <result> <searched when clause> ::= WHEN <search condition> THEN <result> <else clause> ::= ELSE <result> <case operand> ::= <value expression> <when operand> ::= <value expression> <result> ::= <result expression> | NULL <result expression> ::= <value expression> |
The searched CASE expression is probably the most used version of the expression. First the expression is given a data type by seeing what the highest data type in its THEN clauses is. The WHEN … THEN … clauses are executed in left-to-right order. The first WHEN clause that tests TRUE returns the value given in its THEN clause.
And, yes, you can nest CASE expressions inside each other. If no explicit ELSE clause is given for the CASE expression, then the database will insert an implicit ELSE NULL clause before the END keyword. If you wish to return a NULL from a THEN, however, you should use a CAST (NULL AS <data type>) expression to establish the data type for the compiler.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
-this works CASE WHEN 1 = 1 THEN NULL ELSE CAST(NULL AS INTEGER) END -this works CASE WHEN 1 = 1 THEN CAST(NULL AS INTEGER) ELSE NULL END -this does not work; no <result> to establish a data type CASE WHEN 1 = 1 THEN NULL ELSE NULL END - might or might not work in your SQL CAST (CASE WHEN 1 = 1 THEN NULL ELSE NULL END AS INTEGER) |
I recommend always writing an explicit ELSE clause, so that you can change it later when you find a value to return. I would also recommend that you explicitly cast a NULL in the CASE expression THEN clauses to the desired data type.
The <simple case expression> is defined as a searched CASE expression in which all the WHEN clauses are made into equality comparisons against the <case operand> . For example,
1 2 3 4 5 6 |
CASE iso_sex_code WHEN 0 THEN 'Unknown' WHEN 1 THEN 'Male' WHEN 2 THEN 'Female' WHEN 9 THEN 'N/A' ELSE NULL END |
could also be written as:
1 2 3 4 5 6 |
CASE WHEN iso_sex_code = 0 THEN 'Unknown' WHEN iso_sex_code = 1 THEN 'Male' WHEN iso_sex_code = 2 THEN 'Female' WHEN iso_sex_code = 9 THEN 'N/A' ELSE NULL END |
There is a gimmick in this definition, however. The expression,
1 2 3 4 |
CASE foo WHEN 1 THEN 'bar' WHEN NULL THEN 'no bar' END |
becomes
1 2 3 |
CASE WHEN foo = 1 THEN 'bar' WHEN foo = NULL THEN 'no_bar' -- problem! ELSE NULL END |
The “WHEN foo = NULL” clause is always UNKNOWN.
The COALESCE() and NULLIF() Functions
The SQL-92 Standard defines other functions in terms of the CASE expression, which makes the language a bit more compact and easier to implement. For example, the COALESCE() function can be defined for one or two expressions by
- COALESCE(<value exp #1>) is equivalent to (<value exp #1>)
- COALESCE(<value exp #1>, <value exp #2>) is equivalent to
123CASE WHEN <value exp #1> IS NOT NULLTHEN <value exp #1>ELSE <value exp #2> END
then we can recursively define it for (n) expressions, where
1 |
COALESCE(<value exp #1>, <value exp #2>, .., n), |
as equivalent to:
1 2 3 4 |
CASE WHEN <value exp #1> IS NOT NULL THEN <value exp #1> ELSE COALESCE(<value exp #2>, .., n) END; |
Programmers trying to UN-learn their old T-SQL dialect confuse the COALESCE() and NULLIF() functions. The data type proprietary IFNULL(<expression 1>, <expression 2>) is determined by <expression 1>. Run this code to see what happens.
1 2 3 4 5 6 7 8 9 |
BEGIN DECLARE expr_1 INTEGER; DECLARE expr_2 DECIMAL (5,3); SET expr_1 = NULL; SET expr_2 = 99.999; SELECT IFNULL(expr_1, expr_2); --99, integer SELECT COALESCE(expr_1, expr_2); --99.999, decimal END; |
It is even messier when <expression 1> is a character data type.
Likewise, NULLIF (<value exp #1>, <value exp #2>) is equivalent to:
1 2 3 |
CASE WHEN <value exp #1> = <value exp #2> THEN NULL ELSE <value exp #1> END |
The THEN NULL clause should cast to the data type of <value exp #1>, since that is all we have to use as a template. This also makes sense because you don’t want to replace a value with a different data type in different rows in the same column of a table.
Before your get too comfortable with this, there are more problems in T-SQL dialect. Run these statements a few times. The ISNULL() will produce either 666 or 999. The COALESCE() will produce 666,999 or NULL.
1 2 |
SELECT COALESCE((SELECT CASE WHEN RAND() <= 0.5 THEN 666 END), 999); SELECT ISNULL((SELECT CASE WHEN RAND() <= 0.5 THEN 666 END), 999); |
This has been reported as a bug by Hugo Kornelis here. In fact there is a general problem with subquery expressions in a COALESCE() because it is expanded as per the CASE statement and just gave. If the <value exp #1> is not deterministic. T-SQL will re-compute each invocation in the expression and can get different results.
The final 999 value in both expressions should ensure the overall result is never NULL, but Microsoft has it wrong. I have no idea where it comes from. The Standard SQL/PSM language has an option to declare a procedure to be DETERMINISTIC or not to aid the compiler. As I read the Standards, a non-deterministic function call should be evaluated once in each statement where it appears and that value used in the statement for each invocation. Its next invocation in a separate statement is then re-computed. The work-around is to load the non-deterministic function value into a local variable or scalar subquery expression then use that in the COALESCE
Load comments