Procedural, Semi-Procedural and Declarative Programming in SQL

A lot of the time, the key to making SQL databases perform well is to take a break from the keyboard and rethink the way of approaching the problem; and rethinking in terms of a set-based declarative approach. Joe takes a simple discussion abut a problem with a UDF to illustrate the point that ingrained procedural reflexes can often prevent us from seeing simpler set-based techniques.

I have spent many years trying to persuade people to use declarative rather than procedural code in SQL. One of my books is THINKING IN SETS, with the oversized subtitle “Auxiliary, Temporal, and Virtual Tables in SQL” to explain what topics are covered. Most programmers discover that it is too big a big leap from a procedural mindset to a declarative mindset, and so don’t quite make the transition all at once. Instead they evolve from a procedural paradigm to a variety of semi-procedural programming styles.

This is just the way that we learn; a step at a time, not a leap of insight all at once. The first motion pictures, for example, were shot with a fixed position camera aimed at a stage. That is how people had seen stage plays for several thousand years. W. D. Griffith was to movies as Dr. Codd was to databases. Griffith made the first two-reeler in 1910; nobody had thought about more than one reel before that. Then in 1911 and 1912 he and his cameraman, Billy Bitzer, started moving the camera and lens while filming. He gave us new camera angles such as the close-up and soft focus.

In 2010 February, I came across an example, in a Newsgroup discussion, of  programmers making the steps, but not the leap. The code details aren’t relevant to the point I’m making so I am going to gloss over them.

The thread starts with a posting about a User-Defined Function (UDF) that is not working. His opening paragraph was:

“I have the code below to take a set of INTEGER values and return a VARCHAR based on the combination of inputs. However, I’m getting an error on line 6, which is the first line where the word CASE pops up. Plenty of CASE statements have passed through my hands before, so I’m lost of why this one is wrong.”

What followed was a CASE expression with BETWEENs and ORs and CASE within CASE constructs. It took pairs of (x, y) and produced an answer from a set of three values, call them {‘a’, ‘b’, ‘c’}. Again, the coding details are not my point. The body of the function could just as well have been a complicated mathematical expression.

Two repliers pointed out that CASE is an expression and not a statement in SQL. They also pointed out that he was  returning a VARCHAR(1) instead of a CHAR(1).  The CASE function can be confusing to anyone who has been initially trained with a procedural language which has IF-THEN-ELSE statement constructs.

His code looked like this skeleton:

The clean up and quick fix was:

Someone else then asked if he had considered pre-calculating the CASE expression results and populating a table with them. This was good advice, since the number of (x, y) pairs involved came to a few thousand cases. There is no point in dismissing this solution when the look-up table is as small as this one. Read-only tables this size tend to be in main storage or cache, so they can shared among many sessions, and you aren’t going to save much on memory by choosing a different method.

But the person who made this suggestion went on to add “You can use the table with your UDF or you could use it without the UDF.”, but he did not explain what the differences are. They are important. Putting the data in the read-only tables this size will tend to keep it in main storage or cache, If you are really that tight for primary and/or secondary storage that you cannot fit a ~5K row table in your hardware, buy some chips and disks. They are so cheap today. Now the data be shared among many sessions. The table and its indexes can be used by the optimizer. In SQL Server you can include the single column foobar in the index to get a covering index and performance improvement.

But if you choose to lock the data inside the procedural code of a UDF, can it be shared? Do computations get repeated with each invocation? What about indexes? Ouch! A UDF pretty-much locks things inside. Standard SQL/PSM has a [NOT] DETERMINISTIC option in its procedure declarations. This tells the compiler whether the procedure or function is always going to return the same answer for the same arguments.

 note about Standard SQL terms: a parameter is the formal place holder in the parameter list of a declaration and an argument is the value passed in the invocation of the procedure

 A non-deterministic function has to be computed over and over again, every time the UDF is called; if the Query Optimiser doesn’t know for certain whether a a procedure or function is deterministic, it has to assume it isn’t and go the long route.

Here is the skeleton of what was posted.

This is a nice trick, but it is worth noting that it is highly proprietary. No other SQL has ever allowed SELECT without a FROM. Standard SQL requires a RECURSIVE keyword, and you cannot use CTEs with any statement but a SELECT. It is easy enough to re-write this into portable Standard SQL, using a table of integers called Series (I used to call it Sequence, but that is now a reserved word in Standard SQL. If you have a better name, please tell me). This is the most common SQL auxiliary table; experienced SQL programmers create it and then a Calendar table at the start of almost new project.

Recursion, a procedural tool, is expensive. But that is not my point. The first thought was to use a procedural tool and not a data driven approach to get that CROSS JOIN. See what I mean a mindset? This is the semi-procedural guy going back to what he knows. He almost got to a declarative mindset.

Now let’s go on with the rest of the skeleton code for the function:

The reason for COALESCE() is that ‘A’ is a default value in the outer CASE expression, but also a valid result in various THEN and ELSE clauses inside inner CASE expressions. The scalar query will return a NULL if it cannot find an (x, y, foobar) row in the table. If we know that the query covers the entire (x, y) universe, then we did not need the COALESCE() and could have avoided a UDF completely.

Now, let’s think about declarative programming. In SQL that means constraints in the table declaration in the DDL. This skeleton has none except the PRIMARY KEY. Aiee! Here is a problem that you find with magazine articles and newsgroup postings; It is so easy to skip over the constraints when you provide a skeleton table . You did not need them when you declared a file, do you? What one can forget is that the three SQL sub-languages (DDL, DML and DCL) work together. In particular, the DDL constraints are used by the DML optimizer to provide a better execution strategy..

The << horrible CASE expression >> implied expectations for x and y. We were given lower limits (100 and 1), but the upper limits were open after a small range of (x, y) pairs. I think we can assume that the original poster expected the vast majority of cases (or all of them) to fall in that small range and wanted to handle anything else as an error. In the real world, there is usually what Jerry Weinberg called “reasonableness checking” in data. The principle is also known as Zipf’s Law or the “look for a horse and not a zebra” principle in medicine.

The simple first shot would be to assume we always know the limits and can simply use:

The DEFAULT ‘A’ subclause will take care of situation where we did not have an explicit value for foobar. This avoids the COALESCE(). But what if one of the parameters can be anything? That is easy; drop the CHECK() and add a comment. What if one of the parameters is half open or has a huge but sparse space? That is, we know a lower (upper) limit, but not the matching upper (lower) limit. Just use a simple comparison, such as CHECK (y >= 1), instead of a BETWEEN.

A common situation, which was done with nested CASE expression in the original, is that you know a range for a parameter and what the results for the other parameter within that range are. That might be easier to see with code. Here is a CASE expression for some of the possible (x,y) pairs:

This is the DML version of a constraint. It lives only in the INSERT. UPDATE, INSERT or SELECT statement where it appears. What we really want is constraints in the DDL so that all statements, present and future, use it. The trick is to create the table with low and high values for each parameter range; a single value is shown with the low and high values equal to each other.

CASE expression now becomes this table:

As a safety device, put the default ‘A’ in ranges outside the rest of the table. I used -9999 and 9999 for the least and greatest limits, but you get the idea.

The query has to use BETWEENs on the high and low limits:

Is this always going to be the best way to do something? Who knows? Test it.