I have spent many years trying to persuade people to use declarative rather than procedural code in SQL. One of my books is THINKING IN SETS, with the oversized subtitle “Auxiliary, Temporal, and Virtual Tables in SQL” to explain what topics are covered. Most programmers discover that it is too big a big leap from a procedural mindset to a declarative mindset, and so don’t quite make the transition all at once. Instead they evolve from a procedural paradigm to a variety of semi-procedural programming styles.
This is just the way that we learn; a step at a time, not a leap of insight all at once. The first motion pictures, for example, were shot with a fixed position camera aimed at a stage. That is how people had seen stage plays for several thousand years. W. D. Griffith was to movies as Dr. Codd was to databases. Griffith made the first two-reeler in 1910; nobody had thought about more than one reel before that. Then in 1911 and 1912 he and his cameraman, Billy Bitzer, started moving the camera and lens while filming. He gave us new camera angles such as the close-up and soft focus.
In 2010 February, I came across an example, in a Newsgroup discussion, of programmers making the steps, but not the leap. The code details aren’t relevant to the point I’m making so I am going to gloss over them.
The thread starts with a posting about a User-Defined Function (UDF) that is not working. His opening paragraph was:
“I have the code below to take a set of INTEGER values and return a VARCHAR based on the combination of inputs. However, I’m getting an error on line 6, which is the first line where the word CASE pops up. Plenty of CASE statements have passed through my hands before, so I’m lost of why this one is wrong.”
What followed was a CASE expression with BETWEENs and ORs and CASE within CASE constructs. It took pairs of (x, y) and produced an answer from a set of three values, call them {‘a’, ‘b’, ‘c’}. Again, the coding details are not my point. The body of the function could just as well have been a complicated mathematical expression.
Two repliers pointed out that CASE is an expression and not a statement in SQL. They also pointed out that he was returning a VARCHAR(1) instead of a CHAR(1). The CASE function can be confusing to anyone who has been initially trained with a procedural language which has IF-THEN-ELSE statement constructs.
His code looked like this skeleton:
1 2 3 4 5 6 7 |
CREATE FUNCTION FnFindFoobar (@x INTEGER, @y INTEGER) RETURNS VARCHAR WITH EXECUTE AS CALLER AS BEGIN << horrible CASE expression with x and y >>; END; |
The clean up and quick fix was:
1 2 3 4 5 6 7 |
CREATE FUNCTION Find_Foobar (@in_x INTEGER, @in_y INTEGER) RETURNS CHAR(1) WITH EXECUTE AS CALLER AS BEGIN RETURN (<< horrible CASE expression with x and y >>); END; |
Someone else then asked if he had considered pre-calculating the CASE expression results and populating a table with them. This was good advice, since the number of (x, y) pairs involved came to a few thousand cases. There is no point in dismissing this solution when the look-up table is as small as this one. Read-only tables this size tend to be in main storage or cache, so they can shared among many sessions, and you aren’t going to save much on memory by choosing a different method.
But the person who made this suggestion went on to add “You can use the table with your UDF or you could use it without the UDF.”, but he did not explain what the differences are. They are important. Putting the data in the read-only tables this size will tend to keep it in main storage or cache, If you are really that tight for primary and/or secondary storage that you cannot fit a ~5K row table in your hardware, buy some chips and disks. They are so cheap today. Now the data be shared among many sessions. The table and its indexes can be used by the optimizer. In SQL Server you can include the single column foobar in the index to get a covering index and performance improvement.
But if you choose to lock the data inside the procedural code of a UDF, can it be shared? Do computations get repeated with each invocation? What about indexes? Ouch! A UDF pretty-much locks things inside. Standard SQL/PSM has a [NOT] DETERMINISTIC option in its procedure declarations. This tells the compiler whether the procedure or function is always going to return the same answer for the same arguments.
note about Standard SQL terms: a parameter is the formal place holder in the parameter list of a declaration and an argument is the value passed in the invocation of the procedure
A non-deterministic function has to be computed over and over again, every time the UDF is called; if the Query Optimiser doesn’t know for certain whether a a procedure or function is deterministic, it has to assume it isn’t and go the long route.
Here is the skeleton of what was posted.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
-- Create table CREATE TABLE Foobar (x INTEGER NOT NULL, y INTEGER NOT NULL, foobar CHAR(1) NOT NULL, PRIMARY KEY CLUSTERED (x, y)); -- Populate table with recursive CTEs and proprietary syntax WITH X_CTE AS (SELECT 100 AS x UNION ALL SELECT x + 1 FROM X_CTE WHERE x < 300), Y_CTE AS (SELECT 1 AS y UNION ALL SELECT y + 1 FROM Y_CTE WHERE y < 100) INSERT INTO Foobar (x, y, foobar) SELECT x, y, << horrible CASE expression with x and y >> AS foobar FROM X_CTE CROSS JOIN Y_CTE; |
This is a nice trick, but it is worth noting that it is highly proprietary. No other SQL has ever allowed SELECT without a FROM. Standard SQL requires a RECURSIVE keyword, and you cannot use CTEs with any statement but a SELECT. It is easy enough to re-write this into portable Standard SQL, using a table of integers called Series (I used to call it Sequence, but that is now a reserved word in Standard SQL. If you have a better name, please tell me). This is the most common SQL auxiliary table; experienced SQL programmers create it and then a Calendar table at the start of almost new project.
1 2 3 4 5 6 7 8 9 |
INSERT INTO FooLookup (x, y, foobar) SELECT X_CTE.x, Y_CTE.y, << horrible CASE expression >> AS foobar FROM (SELECT seq FROM Series WHERE seq BETWEEN 100 AND 300) AS X_CTE(x) CROSS JOIN (SELECT seq FROM Series WHERE seq BETWEEN 1 AND 100) AS Y_CTE(y); |
Recursion, a procedural tool, is expensive. But that is not my point. The first thought was to use a procedural tool and not a data driven approach to get that CROSS JOIN. See what I mean a mindset? This is the semi-procedural guy going back to what he knows. He almost got to a declarative mindset.
Now let’s go on with the rest of the skeleton code for the function:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CREATE FUNCTION Find_Foobar (@in_x INTEGER, @in_y INTEGER) RETURNS CHAR(1) WITH EXECUTE AS CALLER AS BEGIN RETURN COALESCE ((SELECT foobar FROM Find_Foobar WHERE x = @in_x AND y = @in_y), 'A'); END; |
The reason for COALESCE() is that ‘A’ is a default value in the outer CASE expression, but also a valid result in various THEN and ELSE clauses inside inner CASE expressions. The scalar query will return a NULL if it cannot find an (x, y, foobar) row in the table. If we know that the query covers the entire (x, y) universe, then we did not need the COALESCE() and could have avoided a UDF completely.
Now, let’s think about declarative programming. In SQL that means constraints in the table declaration in the DDL. This skeleton has none except the PRIMARY KEY. Aiee! Here is a problem that you find with magazine articles and newsgroup postings; It is so easy to skip over the constraints when you provide a skeleton table . You did not need them when you declared a file, do you? What one can forget is that the three SQL sub-languages (DDL, DML and DCL) work together. In particular, the DDL constraints are used by the DML optimizer to provide a better execution strategy..
The << horrible CASE expression >>
implied expectations for x and y. We were given lower limits (100 and 1), but the upper limits were open after a small range of (x, y) pairs. I think we can assume that the original poster expected the vast majority of cases (or all of them) to fall in that small range and wanted to handle anything else as an error. In the real world, there is usually what Jerry Weinberg called “reasonableness checking” in data. The principle is also known as Zipf’s Law or the “look for a horse and not a zebra” principle in medicine.
The simple first shot would be to assume we always know the limits and can simply use:
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE FooLookup (x INTEGER NOT NULL CHECK (x BETWEEN 100 AND 300), y INTEGER NOT NULL CHECK (y BETWEEN 1 AND 100), foobar CHAR(1) DEFAULT 'A' NOT NULL CHECK (foobar) IN ('A', 'B', 'C'), PRIMARY KEY (x, y)); |
The DEFAULT ‘A’ subclause will take care of situation where we did not have an explicit value for foobar. This avoids the COALESCE(). But what if one of the parameters can be anything? That is easy; drop the CHECK() and add a comment. What if one of the parameters is half open or has a huge but sparse space? That is, we know a lower (upper) limit, but not the matching upper (lower) limit. Just use a simple comparison, such as CHECK (y >= 1), instead of a BETWEEN.
A common situation, which was done with nested CASE expression in the original, is that you know a range for a parameter and what the results for the other parameter within that range are. That might be easier to see with code. Here is a CASE expression for some of the possible (x,y) pairs:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
CASE WHEN x BETWEEN 100 AND 200 THEN CASE WHEN y IN (2, 4, 6, 8) THEN 'B' WHEN y IN (1, 3, 5, 7, 9) THEN 'C' END WHEN x BETWEEN 201 AND 300 THEN CASE WHEN y IN (2, 4, 6, 8, 99) THEN 'C' WHEN y IN (3, 5, 7, 9, 100) THEN 'B' END ELSE 'A' END |
This is the DML version of a constraint. It lives only in the INSERT. UPDATE, INSERT or SELECT statement where it appears. What we really want is constraints in the DDL so that all statements, present and future, use it. The trick is to create the table with low and high values for each parameter range; a single value is shown with the low and high values equal to each other.
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE FooLookup (low_x INTEGER NOT NULL, high_x INTEGER NOT NULL, CHECK (low_x <= high_x), low_y INTEGER NOT NULL, high_y INTEGER NOT NULL, CHECK (low_y <= high_y), foobar CHAR(1) NOT NULL CHECK (foobar) IN ('A', 'B', 'C'), PRIMARY KEY (x, y)); |
CASE expression now becomes this table:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 |
low_x high_x low_y high_y foobar ================================== 100 200 2 2 'B' 100 200 6 6 'B' 100 200 8 8 'B' 100 200 1 1 'C' 100 200 3 3 'C' 100 200 5 5 'C' 100 200 7 7 'C' 100 200 9 9 'C' 201 300 2 2 'C' 201 300 4 4 'C' 201 300 6 6 'C' 201 300 8 8 'C' 201 300 99 99 'C' 201 300 3 3 'B' 201 300 5 5 'B' 201 300 7 7 'B' 201 300 9 9 'B' 201 300 100 100 'B' 301 9999 101 9999 'A' -9999 99 -9999 0 'A' |
As a safety device, put the default ‘A’ in ranges outside the rest of the table. I used -9999 and 9999 for the least and greatest limits, but you get the idea.
The query has to use BETWEENs on the high and low limits:
1 2 3 4 5 |
SELECT F.foobar, .. FROM FooLookup AS F, .. WHERE my_x BETWEEN F.low_x AND F.high_x AND my_y BETWEEN F.low_y AND F.high_y AND ..; |
Is this always going to be the best way to do something? Who knows? Test it.
Load comments