VALUES() and Long Parameter Lists

To make progress as a relational Database programmer, you have to think in terms of sets, rather than lists, arrays or sequential data. Until that point, you'll feel the need to pass lists, arrays and the like to functions and procedures. Joe suggests some unusual ways around the difficulty and broods on the power of the VALUES constructor.

The original idea of SQL and RDBMS was that all you need to model data is the table. This is a set-oriented approach to data which falls naturally into declarative programming. Sets are defined by intention or by extension. An intentional definition of a set gives you a rule (characteristic function) with which to test membership; and they are handy for sets of large, unknown size and membership. For example, the set of even numbers are integers which have a remainder of zero when they are divided by two.

A set, defined by extension, is a finite collection of values. For example, Donald Duck’s nephews are {Huey, Louie and Dewey}; well, actually, they are Huebert, Deuteronomy and Louis but you have to be a real comic book geek to know that.

What you do not have in RDBMS are arrays, linked lists, sequential files and other data structures that have to be navigated. This drives procedural programmers nuts. About once a month, one of them will post a request to a newsgroup asking how to pass a CSV (comma separated values) list in a string as a parameter to a store procedure. Every time, the links to several tricks are posted over and over, with the definitive articles at Erland Sommarskog’s website.

The approaches to this problem can be classified as (1) Iterative algorithms (2) Auxiliary tables (3) Non-SQL tools.

The iterative approaches explain themselves. Just imagine that you are back in your freshman C or Assembly language programming class and are working at the lowest level you can. Read the CSV from left to right. When you get to a comma, convert the substring to the left of the comma into the target data type. Drop the comma; repeat the WHILE loop until you have consumed the string.

A more recent variation is to use a Recursive CTE, but it is essentially the same thing.

Another classic approach is to use a Series table (a list of numbers from 1 to n). First put a comma on both ends of the input string. All of the substrings we want to cast are bracketed by commas, so we extract them as a set. Here is one version of these procedures.

Or maybe the CTE variant

To be honest, you would probably want to trim blanks and perhaps do other tests on the string, such as seeing that LOWER(@inputstring) = UPPER(@inputstring) to avoid alphabetic characters, and so forth.

The integer substrings are located between the (i)-th and (i+1)-th comma pairs. In effect, the sequence table replaces the loop counter. The Series table has to have enough numbers to cover the entire string, but unless you really like to type in long parameter list, this should not be a problem. The last two predicates n2.i<=LEN(@inputString)+2 and n2.i<=LEN(@inputString)+2 in the first version are to avoid a Cartesian product with the Series table.

Finally, the external methods involve a CLR function and XML processing. The most fundamental complaint against them is in maintaining code. Your CLR routine will be written in one of the dozens of CLR languages and you are assuming that the next guy will also be fluent. The same complaint about extra maintenance work can be made against XML. The SQL optimizer is never going to do anything with this external code, and the overhead of passing data via various APIs will never go away.

The simplest answer is to use a long parameter list to construct lists and derived tables inside the procedure body. SQL server can handle up to 2100 parameters, which should be more than enough for practical purposes. SQL Server is actually a wimp in this regard; DB2 ;can pass 32K parameters. and Oracle can have 64K parameters

Their database engines use those extreme limits for certain system functions that the users will never see. Nobody expects a human being to type in thousands of lines of CSV text with any of these techniques. In practice, I have seen a list of 64 (Chessboard), 81 (Sudoku solver) and 361 (Go board) simple parameters. Most of the time, 100 parameters is a very safe upper limit. If I need to pass more then I want to look at an ETL tool or something else.

The main advantages of the long parameter lists are:

  1. The code is highly portable to any Standard SQL. One product bigots need to get over the idea that they are never going to see more than one SQL engine in their careers.
  2. The code is pure native SQL and not an external procedural language. The people that maintain it can be SQL programmers. There will be no need to send your C# programmer to F# classes next year to keep up.
  3. The optimizer will treat them like any other parameter. They can be sniffed. You can use a RECOMPILE option and get the best performance possible each time the procedure runs.
  4. The compiler will treat them like any other parameter. You get the expected error messages and conversion. Frankly, I have never seen anyone who used one of the other techniques write RAISERROR() calls to return the same error messages as the compiler.
  5. It is easy to generate the code with a text editor. Think about how hard it is to write in a CLR language you don’t know or to write a loop in T-SQL. I just cut & paste a skeleton with 250 parameters then cut off what I need. I am going to show a list of five parameters in my skeleton code to save space.

The simplest example of the long parameter list technique is just to use them: Pass what you need and leave the rest of the list to default to NULLs.

If you had a long list, there would be a lot of NULLs in the IN() predicate list. It would be a good idea to clean them out. But you cannot do that with a simple IN() predicate list. We need to get the data into a column in a derived table first. This is another simple text edit problem.

The use of the VALUES() constructor is new to SQL Server programmers, but other products have had it for awhile now. Here is a simple way to keep a list of constants in a way that prevents them from being accidentally updated:

This is not quite complete. if you go to the VALUES list, you can explicitly CAST() the first row in the constructor and the rest of the table will follow it.

We could use a view of a similar list of constants, the NATO (international radiotelephony) spelling alphabet

VALUES is a row constructor, so it can do more than just parse a CSV string into scalars. Let’s say I want to sell off the animals in my zoo. I can put the order header data in the front of the parameter list, then use the rest of the list for order details.

A skeleton for this insertion will use the price list I set, and join it to the animal’s SKU codes on the order form I construct from the parameter list.

Do you see the idea? An entire program is being built from the long parameter list in a single declarative statement that uses no special features.

Think about what you can do with CASE expressions and computations inside the VALUES() constructor. Most of the procedural control of flow can be avoided. Likewise, a MERGE statement is quite handy  to get a table constructor in the USING clause. What we are doing is writing SQL like a functional programming language.

But isn’t the VALUES() constructor like a table variable? Table variables are like tables and have some restrictions. You cannot use a table variable as an input or an output parameter; it is scoped just like any local variable you create with a DECLARE statement. The variable will no longer exist after the procedure exits. That much is like a parameter. The table variable needs to be declared, allocated in storage then used with INSERT, UPDATE and DELETE statements to get data in and out of it. There are also other restrictions on indexes and ALTER statements, but ignore those things for now.

But the long parameter list is made up of plain old parameters, which means that they can be used for outputs as well as inputs. It also is contained in one statement for the optimizer to use.

Do I think these techniques will always be the best solution? Of course not. There is no such thing in SQL. But it is worth looking at when you get an appropriate problem.