Avoiding the EAV of Destruction

A forum posting, from someone who wanted a better solution to the common problem of handling global settings in a database, leads Joe Celko into a fascinating discussion of the ramifications of the various solutions.

This interesting problem appeared on the SQL Server programming newsgroup in Early 2009 February. Let me just paste in the quote that started the thread.

Hello, in a database we deploy in our customer’s company, we need to use a lot of parameters. For example, how many invoices should we print by default. These parameters are used in a lot of queries, stored procedures, etc … Do you know a best design for this? Now, the parameters are stored in a table with a name, a type and several columns corresponding to each type..

Before talking about the possible solutions, I want to talk about the mindset that leads to each solution. It is not enough to have an answer; you need to know what the process was that lead to a solution. Math majors appreciate that it is not enough to just come up with a final answer. You have to show your work and give a proof that this is a good answer.

We have all used some kind of configuration file in a software package. A lot of us have probably written one. The C language has a preprocessor that can look at flags in the command line and do conditional compilations. Since C is a low-level systems language, most of the conditional compilations were aimed at the hardware – what kind of disk drives and how many of them were in the system. The preprocessor code was often more complex than the final C program you compiled when all the macros were expanded. You were writing device drivers and hardware dependent code on the fly!

As we moved to higher level languages for applications, we used libraries to interface to hardware. The configuration parameters became more involved with the application features than the hardware. And we have support for constants in the application languages.

Most procedural languages have something like Pascal’s declaration syntax:

Or us older guys remember FORTRAN:

There was also a DATA statement in FORTRAN. It is important to distinguish between the DATA and PARAMETER statements. The DATA statement merely specifies an initial value for a variable or array which may be altered during the course of execution. The PARAMETER statement, which came later, specifies values for constants which cannot be changed without recompiling the program.

The idea is to gather all of the constants needed in one place, tell the optimizer that they will not change so that it could do its magic when it compiled the program. But let me share a classic quote:

The primary purpose of the DATA statement is to give names to constants; instead of referring to pi as 3.141592653589793 at every appearance, the variable pi can be given that value with a DATA statement and used instead of the longer form of the constant. This also simplifies modifying the program, should the value of pi change.

— FORTRAN manual for Xerox Computers —

You cannot make up stuff like that.

What the  SQL Server programming newsgroup post described is called EAV (“Entity-Attribute-Value”) tables and it is a very common schema design error for programmers who started with an OO or a loosely typed programming language.

The table looks something like this:

Did you notice that everything has to be a string? Now we need to do a lot of CAST() expressions to use the parameter_value when it is a numeric or a temporal parameter.

Try to write a single CHECK() constraint that works for all the parameters. It can be done! You need a CASE expression with WHEN clauses that include every parameter_name as part of the test. For example:

Next, write a CHECK () among two or more parameters. Keep it simple, something like “print_buffer_cnt >= 3 * default_invoice_cnt” then add more complex configuration rules. That CASE expression is a horrible mess, isn’t it? Oh, remember that the WHEN clauses are tested in the order they were written. If you aren’t very careful, you might have them in the wrong order and some of them will never execute.

Try to write a single DEFAULT clause for all the values crammed into one parameter_value column. Impossible! DEFAULT is a scalar value and not a CASE expression. But even if a CASE expression were allowed, it would look as bad the monolithic CASE expression used in the magical, universal, generic CHECK() constraint.

While not usually required for a parameter list, try to set up DRI among the entities. Much like not having a magical, universal DEFAULT, you cannot have a magical, universal DRI action that will CASCADE, SET NULL, RESTRICT and/or SET DEFAULT as needed. That work has to be done with a trigger.

If you thought the WHEN clauses in the single CASE expression were unmaintainable, wait until you see the “TRIGGER from Hell” code. Now maintain it. It will not optimize either! Triggers are procedural and not declarative.

A really good horror story about EAV disasters is here on Simple-Talk, called Bad Carma

Let’s get back to the original poster, whom you probably forgot about by now. Continuing on, he says:

I do not like this so much and I was thinking about creating one very simple scalar function for each parameter. Easy to modify at deployment time or with our interface application, and, more important, better for performances I guess. Something like:

This has some advantage over the EAV model. Only the required functions will be used, rather than loading the whole EAV table. You can be sure of the proper data types and you can add some logic with a RETURN (CASE WHEN ..) expression.

But the down side is that functions do not pass information to the optimizer. If we have a constraint like "print_buffer_cnt >= 3 * default_invoice_cnt", then we need a query or other logic inside the function to catch it. But that means a call inside the invoice_print_cnt() function that calls the print_buffer_cnt() .. or the other way around or maybe a co-routine call between both functions. Wow! That could get tricky and perhaps lead to endless looping.

The practical solution is to write a program to do all of the inter-parameter constraint checking and then generate simple “return a value” functions like our invoice_print_cnt() example. But if you change a parameter, then you need a re-compile of the running system. Hopefully, installation options will rarely, if ever, change.

A proprietary method would be to write SQLCMD scripts with lots of IF-THEN-ELSE logic using $ install-time variables in the script. But it is not possible to modify once it is installed, except by running another script that will ALTER the procedures.

My suggestion is that you could create a table with all the system parameters each in its own column. The skeleton for the table would look like this:

This gets the right data types without CAST() expressions. Each column can have one or more CHECK () constraints and its own DEFAULT. You can also have constraints among the columns, as shown in the skeleton code. And finally, you could include references to other tables. I cannot come up with a good example of that in a configuration table, but someone probably has an example.

The “one_row_lock” column is a way to assure that the table has only one row in it. I’ll get back to that in a little while.

The really cool trick is to use the little-known INSERT INTO.. DEFAULT VALUES statement to reset the configuration to the defaults.

The DELETE FROM statement will clean out the table; all constraints are TRUE on an empty table. The INSERT INTO.. DEFAULT VALUES statement will do what it says – it inserts a row with all the default values of each column. We will assume that the DEFAULT values do not conflict with any CHECK () or referential constraints. Yes, I know that sounds stupid, but check it out anyway.

There is here another advantage in this single table. You can replace the “one_row_lock” column with multiple configurations in the same table.

As it turned out, the original application can be used in a multi-company environment. They have clients who use the application to manage three or four companies of their group. The only disadvantage is that if you execute ResetConfiguration(), you will wipe out any configuration except the zero-th, which I would presume is the most general, minimal or most basic version of the application.

There is another trick using row constructors, or the VALUES() clause. In Standard SQL, it is more general than just one row in an INSERT INTO statement and Microsoft is catching up. The <expression list> can have more than constants or simple variables in SQL Server now. That is just the way that most SQL Server programmers write code. But function calls, computations, scalar subquery expressions or anything that returns a proper scalar value (or an allowed NULL) is fine. It has more power than most programmers think.

In Standard SQL, you have been able to use the VALUES() clause anywhere that a derived table could be used. But like a derived table, you need to give it a dummy name. This gives us another way to prevent any updates use a VIEW defined with a table constructor.

Please notice that you have to use CAST() operators to assure that the data types are correct. This is not a problem with INTEGER values, but the example could have FLOAT, REAL, DOUBLE PRECISION or DECIMAL(s,p) as options It is also just good documentation.

The full syntax for the VALUES clause is:

VALUES (<expression list>)[, (<expression list>)]

The new (well, to Microsoft) version of VALUES allows a list of <expression list>s. Each of these row constructors becomes a row in a nameless derived table. You then use an AS <table name> [(<column name list>)] alias to reference it.

This idea extends to constant tables. A classic example would be look-up tables for functions in statistics and finance, such as the Student-T distribution for small samples. The value of (r) is the size of the sample minus one and the percentages are the confidence intervals. If you don’t remember your college statistics course, then just think of your favorite look up table.

Loosely speaking, the Student’s t-distribution is the best guess at the population distribution that we can make without knowing the standard deviation with a certain level of confidence. William Gosset created this statistic in 1908. His employer, Guinness Breweries, required him to publish under a pseudonym, so he chose “Student” and that name stuck. Here is a short table:

This becomes the VIEW:

Notice that the first row has the CAST() function on all the columns. This will guarantee that all the columns in the VIEW will have the appropriate data types. You do not need to repeat it for all the rows.

Another version of this trick to use a CTE (Common Table Expression) instead of a VIEW:

Most look-up tables should be in VIEWs since it is very likely that more than one query is going to use them.