18 June 2009

30,567 views

35 0

Avoiding the EAV of Destruction

A forum posting, from someone who wanted a better solution to the common problem of handling global settings in a database, leads Joe Celko into a fascinating discussion of the ramifications of the various solutions.

This interesting problem appeared on the SQL Server programming newsgroup in Early 2009 February. Let me just paste in the quote that started the thread.

Hello, in a database we deploy in our customer’s company, we need to use a lot of parameters. For example, how many invoices should we print by default. These parameters are used in a lot of queries, stored procedures, etc … Do you know a best design for this? Now, the parameters are stored in a table with a name, a type and several columns corresponding to each type..

Before talking about the possible solutions, I want to talk about the mindset that leads to each solution. It is not enough to have an answer; you need to know what the process was that lead to a solution. Math majors appreciate that it is not enough to just come up with a final answer. You have to show your work and give a proof that this is a good answer.

We have all used some kind of configuration file in a software package. A lot of us have probably written one. The C language has a preprocessor that can look at flags in the command line and do conditional compilations. Since C is a low-level systems language, most of the conditional compilations were aimed at the hardware – what kind of disk drives and how many of them were in the system. The preprocessor code was often more complex than the final C program you compiled when all the macros were expanded. You were writing device drivers and hardware dependent code on the fly!

As we moved to higher level languages for applications, we used libraries to interface to hardware. The configuration parameters became more involved with the application features than the hardware. And we have support for constants in the application languages.

Most procedural languages have something like Pascal’s declaration syntax:

1	CONST pi: REAL = 3.14;

Or us older guys remember FORTRAN:

DOUBLE PRECISION pi, c

PARAMETER(pi = 3.14159265358979d0)

PARAMETER(c = 2.998d8)

There was also a DATA statement in FORTRAN. It is important to distinguish between the DATA and PARAMETER statements. The DATA statement merely specifies an initial value for a variable or array which may be altered during the course of execution. The PARAMETER statement, which came later, specifies values for constants which cannot be changed without recompiling the program.

The idea is to gather all of the constants needed in one place, tell the optimizer that they will not change so that it could do its magic when it compiled the program. But let me share a classic quote:

The primary purpose of the DATA statement is to give names to constants; instead of referring to pi as 3.141592653589793 at every appearance, the variable pi can be given that value with a DATA statement and used instead of the longer form of the constant. This also simplifies modifying the program, should the value of pi change.

— FORTRAN manual for Xerox Computers —

You cannot make up stuff like that.

What the SQL Server programming newsgroup post described is called EAV (“Entity-Attribute-Value”) tables and it is a very common schema design error for programmers who started with an OO or a loosely typed programming language.

The table looks something like this:

CREATE TABLE eav

(parameter_name VARCHAR(100) NOT NULL PRIMARY KEY,

parameter_datatype VARCHAR(15) NOT NULL

CHECK (parameter_datatype IN ('INTEGER', 'REAL', etc))

parameter_value VARCHAR(200), -- nullable? ..);

Did you notice that everything has to be a string? Now we need to do a lot of CAST() expressions to use the parameter_value when it is a numeric or a temporal parameter.

Try to write a single CHECK() constraint that works for all the parameters. It can be done! You need a CASE expression with WHEN clauses that include every parameter_name as part of the test. For example:

CHECK (CASE WHEN parameter_name = 'default_invoice_cnt'

AND CAST (parameter_value AS INTEGER) > 0

THEN 'T'

ELSE 'F' END = 'T')

Next, write a CHECK () among two or more parameters. Keep it simple, something like “print_buffer_cnt >= 3 * default_invoice_cnt” then add more complex configuration rules. That CASE expression is a horrible mess, isn’t it? Oh, remember that the WHEN clauses are tested in the order they were written. If you aren’t very careful, you might have them in the wrong order and some of them will never execute.

Try to write a single DEFAULT clause for all the values crammed into one parameter_value column. Impossible! DEFAULT is a scalar value and not a CASE expression. But even if a CASE expression were allowed, it would look as bad the monolithic CASE expression used in the magical, universal, generic CHECK() constraint.

While not usually required for a parameter list, try to set up DRI among the entities. Much like not having a magical, universal DEFAULT, you cannot have a magical, universal DRI action that will CASCADE, SET NULL, RESTRICT and/or SET DEFAULT as needed. That work has to be done with a trigger.

If you thought the WHEN clauses in the single CASE expression were unmaintainable, wait until you see the “TRIGGER from Hell” code. Now maintain it. It will not optimize either! Triggers are procedural and not declarative.

A really good horror story about EAV disasters is here on Simple-Talk, called Bad Carma

Let’s get back to the original poster, whom you probably forgot about by now. Continuing on, he says:

I do not like this so much and I was thinking about creating one very simple scalar function for each parameter. Easy to modify at deployment time or with our interface application, and, more important, better for performances I guess. Something like:

CREATE FUNCTION invoice_print_cnt()

RETURNS INTEGER

AS BEGIN

-- other code if needed

RETURN 3;

END;

This has some advantage over the EAV model. Only the required functions will be used, rather than loading the whole EAV table. You can be sure of the proper data types and you can add some logic with a RETURN (CASE WHEN ..) expression.

But the down side is that functions do not pass information to the optimizer. If we have a constraint like "print_buffer_cnt >= 3 * default_invoice_cnt", then we need a query or other logic inside the function to catch it. But that means a call inside the invoice_print_cnt() function that calls the print_buffer_cnt() .. or the other way around or maybe a co-routine call between both functions. Wow! That could get tricky and perhaps lead to endless looping.

The practical solution is to write a program to do all of the inter-parameter constraint checking and then generate simple “return a value” functions like our invoice_print_cnt() example. But if you change a parameter, then you need a re-compile of the running system. Hopefully, installation options will rarely, if ever, change.

A proprietary method would be to write SQLCMD scripts with lots of IF-THEN-ELSE logic using $ install-time variables in the script. But it is not possible to modify once it is installed, except by running another script that will ALTER the procedures.

My suggestion is that you could create a table with all the system parameters each in its own column. The skeleton for the table would look like this:

CREATE TABLE ConfigurationParameters

(one_row_lock CHAR(1) DEFAULT 'X' NOT NULL PRIMARY KEY

CHECK (one_row_lock = 'X'),

invoice_print_cnt INTEGER DEFAULT 1 NOT NULL

CHECK (invoice_print_cnt > 0),

print_buffer_cnt INTEGER DEFAULT 3 NOT NULL,

CONSTRAINT buffer_and_printers_okay

CHECK (print_buffer_cnt >= 3 * default_invoice_cnt),

Etc.

);

This gets the right data types without CAST() expressions. Each column can have one or more CHECK () constraints and its own DEFAULT. You can also have constraints among the columns, as shown in the skeleton code. And finally, you could include references to other tables. I cannot come up with a good example of that in a configuration table, but someone probably has an example.

The “one_row_lock” column is a way to assure that the table has only one row in it. I’ll get back to that in a little while.

The really cool trick is to use the little-known INSERT INTO.. DEFAULT VALUES statement to reset the configuration to the defaults.

CREATE PROCEDURE ResetConfiguration()

BEGIN

-- ask is this really what you want to do?

DELETE FROM ConfigurationParameters;

INSERT INTO ConfigurationParameters DEFAULT VALUES;

END;

The DELETE FROM statement will clean out the table; all constraints are TRUE on an empty table. The INSERT INTO.. DEFAULT VALUES statement will do what it says – it inserts a row with all the default values of each column. We will assume that the DEFAULT values do not conflict with any CHECK () or referential constraints. Yes, I know that sounds stupid, but check it out anyway.

There is here another advantage in this single table. You can replace the “one_row_lock” column with multiple configurations in the same table.

CREATE TABLE ConfigurationParameters

(configuration_nbr INTEGER DEFAULT 0 NOT NULL PRIMARY KEY

CHECK (configuration_nbr BETWEEN 0 AND 3),

Etc.);

As it turned out, the original application can be used in a multi-company environment. They have clients who use the application to manage three or four companies of their group. The only disadvantage is that if you execute ResetConfiguration(), you will wipe out any configuration except the zero-th, which I would presume is the most general, minimal or most basic version of the application.

There is another trick using row constructors, or the VALUES() clause. In Standard SQL, it is more general than just one row in an INSERT INTO statement and Microsoft is catching up. The <expression list> can have more than constants or simple variables in SQL Server now. That is just the way that most SQL Server programmers write code. But function calls, computations, scalar subquery expressions or anything that returns a proper scalar value (or an allowed NULL) is fine. It has more power than most programmers think.

In Standard SQL, you have been able to use the VALUES() clause anywhere that a derived table could be used. But like a derived table, you need to give it a dummy name. This gives us another way to prevent any updates use a VIEW defined with a table constructor.

CREATE VIEW Constants (pi, e, phi)

SELECT X.*

FROM (VALUES (CAST (3.142592653 AS FLOAT),

CAST (2.71828182 AS FLOAT),

CAST (1.6180339887 AS FLOAT))

) AS X (pi, e, phi);

Please notice that you have to use CAST() operators to assure that the data types are correct. This is not a problem with INTEGER values, but the example could have FLOAT, REAL, DOUBLE PRECISION or DECIMAL(s,p) as options It is also just good documentation.

The full syntax for the VALUES clause is:

VALUES (<expression list>)[, (<expression list>)]

The new (well, to Microsoft) version of VALUES allows a list of <expression list>s. Each of these row constructors becomes a row in a nameless derived table. You then use an AS <table name> [(<column name list>)] alias to reference it.

This idea extends to constant tables. A classic example would be look-up tables for functions in statistics and finance, such as the Student-T distribution for small samples. The value of (r) is the size of the sample minus one and the percentages are the confidence intervals. If you don’t remember your college statistics course, then just think of your favorite look up table.

Loosely speaking, the Student’s t-distribution is the best guess at the population distribution that we can make without knowing the standard deviation with a certain level of confidence. William Gosset created this statistic in 1908. His employer, Guinness Breweries, required him to publish under a pseudonym, so he chose “Student” and that name stuck. Here is a short table:

r 90% 95% 97.5% 99.5%

=======================================

1 3.07766 6.31371 12.7062 63.65600

2 1.88562 2.91999 4.30265 9.92482

3 1.63774 2.35336 3.18243 5.84089

4 1.53321 2.13185 2.77644 4.60393

5 1.47588 2.01505 2.57058 4.03212

10 1.37218 1.81246 2.22814 3.16922

30 1.31042 1.69726 2.04227 2.74999

100 1.29007 1.66023 1.98397 2.62589

.. 1.28156 1.64487 1.95999 2.57584

This becomes the VIEW:

CREATE VIEW Student_T(r, c900, c950, c975, c995)

AS SELECT X.*

FROM (VALUES (CAST (1 AS INTEGER),

CAST (3.07766 AS FLOAT),

CAST (6.31371 AS FLOAT),

CAST (12.7062 AS FLOAT),

CAST (63.65600 AS FLOAT)),

(2, 1.88562, 2.91999, 4.30265, 9.92482),

(3, 1.63774, 2.35336, 3.18243, 5.84089),

etc)

AS X;

Notice that the first row has the CAST() function on all the columns. This will guarantee that all the columns in the VIEW will have the appropriate data types. You do not need to repeat it for all the rows.

Another version of this trick to use a CTE (Common Table Expression) instead of a VIEW:

WITH Student_T(r, c900, c950, c975, c995)

AS SELECT X.*

FROM (VALUES (CAST (1 AS INTEGER), -- redundant but safe

CAST (3.07766 AS FLOAT),

CAST (6.31371 AS FLOAT),

CAST (12.7062 AS FLOAT),

CAST (63.65600 AS FLOAT)),

(2, 1.88562, 2.91999, 4.30265, 9.92482),

(3, 1.63774, 2.35336, 3.18243, 5.84089),

..) AS X

SELECT (..) AS t, etc

FROM <<sample table expression>>

WHERE ..;

Most look-up tables should be in VIEWs since it is very likely that more than one query is going to use them.

Avoiding the EAV of Destruction

Subscribe for more articles

Rate this article

Joe Celko

Related articles

Backing Up MySQL Part 7: MySQL Enterprise Backup

A Tale of Identifiers

What’s new in T-SQL in SQL Server 2022

Tags