SQL and the Snare of Three-Valued Logic

The whole subject of the Three-Valued (also known as ternary, trivalent or 3VL) Logic of SQL tends to trip people up. This is hardly surprising in view of the fact that it involves an esoteric Polish mathematician and because it behaves differently in the DDL (Data Declaration Language) and the DML (Data Manipulation Language). In response to requests, Joe Celko comes to the rescue and makes it all seem simple.

At the end of my article on Relational Division, one of the comments, by a reader named Dean, asked if the sentence “There ain’t no planes in this hangar that I can’t fly!” is a double negative or a triple? And he asked for an article on Boolean logic next.

Here it is! But the problem is that SQL has a three valued logic (3VL) and not the two values of Mr. George Boole. In fact, SQL logic is complicated.

NULLs

Like many problems, this starts with NULLs.  Dr. Codd put the concept into the first version of the Relational Model.  It is not a value; it is a “place holder” for a value.  The history of the NULL is interesting.  When Dr. Codd and Chris Date were business partners, Date was opposed to NULLs and this was a great debate in those days. 

Later Dr. Codd did a second version of the Relational Model which has two kinds of NULLs.  One kind of NULL marks values which are missing because the value is unknown, and the other kind marks values that are missing because the attribute is missing.  For example, my hat size exists (I have a head) but might not be known so it is the first kind of NULL; my hair color does not exist since I am bald as a cue ball, so it is the second kind of NULL.  FirstSQL is the only product which implemented Dr. Codd’s second model. 

SQL followed Dr. Codd’s first model and has a single NULL. But much like FORTRAN follows algebra; SQL had to make adjustments from the mathematical model.  In the Relational Model, a NULL has no data type.  In SQL, the compiler has to know about storage requirements so you can write “CAST (NULL AS <data type>)” to pass that information. 

NULLs have certain basic properties:

  1. All SQL data types are NULL-able.  This is one reason why IDENTITY is a table property and not a data type.  The other reason is that a table can have only one IDENTITY column in it.  The count of PHYSICAL insertion attempts (NOT  successes) is not an attribute; it is audit meta-data and has no place in RDBMS. 

    This is why the first implementations of BIT which were assembly language bits were not a data type.  When BIT became a numeric data type, then things were Kosher. 
     

  2. NULLs propagate.  If you use a NULL in an expression, then result is a NULL.  In numeric expressions, we had questions about priorities, but propagation is vital.  In particular:
    “NULL / 0” = NULL or “division by zero” error?

    “0 / NULL” = NULL or 0, since this can only resolve to zero for any value.  No, I am not going to tell you; test it yourself in Query Analyzer or SSMS!  But can you figure it out from principles? 

TRUE, FALSE and UNKNOWN

A comparison between known values gives you a result of TRUE or FALSE.  This is Boolean logic.  If you get a chance look at some of the Syd Harris cartoons about George Boole at S. Harris Computer Cartoons

But that is classic two-valued logic (2VL); all the values are known.  When you do a comparison with a NULL, you cannot get a Boolean (i.e. TRUE or FALSE) result. 

This is where we invented the logical value UNKNOWN.  Well, re-discovered it.  There were already a lot of multi-valued logics in the mathematical literature.  Some of them are based on discrete values and some on continuous values (i.e. fuzzy logic).  SQL looks like the system developed by Polish logician Jan Åukasiewicz (the L-bar is a “W” sound in English; you can get the rest of it at Wikipedia: Jan Åukasiewicz).  Programmers know him as the guy who invented Polish Notation and inspired Reverse Polish Notation for HP calculators and stack architecture (aka zero address machines) computers like the old Burroughs machines.  SQL has three logical operations, which are found in all programming languages.  These are extended to three values:  

OR

TRUE

FALSE

UNKNOWN

TRUE

TRUE

TRUE

TRUE

FALSE

TRUE

FALSE

UNKNOWN

UNKNOWN

TRUE

UNKNOWN

UNKNOWN

AND

TRUE

FALSE

UNKNOWN

TRUE

TRUE

FALSE

UNKNOWN

FALSE

FALSE

FALSE

UNKNOWN

UNKNOWN

UNKNOWN

UNKNOWN

UNKNOWN

NOT

 

TRUE

FALSE

FALSE

TRUE

UNKNOWN

UNKNOWN

In the Åukasiewicz multi-valued logic systems, the AND, OR and NOT are almost the same as in SQL’s three valued logic.  The general case is based on the following Polish notation formulas in which 1 is TRUE, 0 is FALSE and fractions are the other values.

  • Cpq = 1          for p <= q
  • Cpq = 1 - p + q  for (p > q)
  • Np = 1 - p

N is the negation operator, C is the implication operator.

From this pair, we define all other operators

  • Opq = CCpqq     Or
  • Apq = NONpNq    And
  • Epq = ACpqCqp   Equivalence
  • Mp = CNpp       maybe (not FALSE)

Then Tarski and Lukasiewicz had a lot of rules of inference to prove theorems.  David McGoveran pointed out that SQL is not really a logic system because it lacks rules for deductions and proofs.  The idea is that the UNKNOWN might be resolved to {TRUE, FALSE}, so can we sometimes determine the result regardless of how one value resolves.  If we cannot determine a result, then return UNKNOWN. 

Let me sum up.  SQL is not a formal logical system.  We have no inference rules and what we informally call “predicates” are actually “search conditions” in the ANSI/ISO Standards.  What we have is a collection of look-up tables to compute a value of {TRUE, FALSE, UNKNOWN}. 

This is a subtle difference for anyone who is not a math major, but an inference system is like those axioms, postulates and theorems you had in High School geometry.  You need an implication operation.  In regular 2VL, this is shown as a double tailed right arrow ” =>” and it is defined by the look-up table in 2VL, where the column is the first operand:

IMPLIES

TRUE

FALSE

TRUE

TRUE

FALSE

FALSE

TRUE

TRUE

Or informally, a TRUE premise cannot imply a FALSE conclusion.  A FALSE premise can imply anything.  This 2VL model can also be written as:

  • (A IMPLIES B) = NOT (A AND NOT B)

But if you apply De Morgan’s Law, it can also be written as

  • (A IMPLIES B) = (NOT(A) OR B)

Now expand these 2VL rules into 3VL rules.

IMPLIES-1

TRUE

FALSE

UNKNOWN

TRUE

TRUE

FALSE

UNKNOWN

FALSE

TRUE

TRUE

UNKNOWN

UNKNOWN

UNKNOWN

FALSE

UNKNOWN

IMPLIES-2

TRUE

FALSE

UNKNOWN

TRUE

TRUE

TRUE

TRUE

FALSE

TRUE

TRUE

TRUE

UNKNOWN

UNKNOWN

UNKNOWN

UNKNOWN

Oops!  They are not the same.  The premise “X IMPLIES {FALSE, UNKNOWN}” could argue that a FALSE premise X can imply anything, TRUE, FALSE or UNKNOWN.  Or that propagation of unknown values applies, so a FALSE premise can lead to a TRUE conclusion. Make a decision and design a logical system for yourself. 

We avoided it in the SQL Standards. We did not do implication and the other rules.  This is a problem for those of us doing an optimizer for SQL. 

BIT Flags versus Predicates

One of the major problems for those people learning SQL is that they have not un-learned a mindset based originally on punch cards and magnetic tapes instead of RDBMS.  Even if they never saw a punch card or a magnetic tape, they think of data processing as a sequence of procedural steps that depend on sequentially ordered data.  At each step the data is moved to another file, scratch tape or temporary table.

Did you ever think about the people who use a singular table name instead of a collective or plural name?  In a file system, you process one record at time.  Calling the current record “Employee” makes sense.  But in RDBMS, we work with the entire table (set) all at once.  Therefore we call that table “Personnel” instead. 

We used bit flags in the Dark Ages because we lacked storage and computing power.  We would discover a fact that was TRUE at one point in time and then set a flag to mark that state within the system.  A classic example is a deletion flag at the start of a tape file record.  When the system read it, it would skip forward on the tape to the next record.  Eventually, the deleted records would be removed when the data was copied over to a new tape. 

An embarrassing story of my own was writing a small restaurant scheduling program in BASIC on an early Personal Computer.  I used a flag to show if the staff member was at least 18 years old which was the legal age to consume and serve alcoholic beverages at the time.  The legal age then changed to 21 and my program cheerfully scheduled an entire shift of now under-aged servers on the night that the state Alcoholic Beverage Control agency decides to check on the new law.  My client was not happy. 

If I had SQL back then, I would have used the employee birthdates to set up the shifts.  A simple two keystrokes in a VIEW declaration would have done it. 

SQL is a “predicate language” by its nature in the same way that FORTRAN is an “algebra language” by its nature.  Because the data is shared, and in constant motion, you cannot ever trust that the flags will actually show the current state of the data.  You have to test it when you do something. Silly bit-flags work in file systems because nobody else can physically read the tape as it goes thru its sequential process. 

DDL versus DML

As if all of this was not weird enough, SQL’s 3VL behaves differently in the DDL (Data Declaration Language) and the DML (Data Manipulation Language). 

When you declare a column in a table in the DDL, you have the options of making in NOT NULL, giving it a DEFAULT value, using it in a CHECK() constraint and/or adding DRI actions.  Those options are one of the many reasons that a column in RDBMS is different from a field in a file system. 

When I use a CHECK() constraint in the DDL and some or all of the columns are NULL-able, the search condition can return UNKNOWN.  This is fine because DDL accepts {TRUE, UNKNOWN} as the same and will permit the data to persist. 

We did that because adding the extra search conditions to test for NULLs would be a screaming nightmare.  Try it. 

In DML, we are not so forgiving.  DML sees {FALSE, UNKNOWN} as the same and will reject the data, keeping only that which tests TRUE. 

This can be weird.  You put something into a table and you need to work extra hard to get it back out because of NULLs and 3VL.  That is why we SQL pros keep telling you to not make a column NULL-able unless it makes sense in the data model. 

Shorthand

SQL is a higher level language so it has shorthands in it.  But they can all be expanded out into combinations of {AND, OR, NOT}.  Beginners in SQL often do not know how to use them and so tend to stick with the familiar {AND, OR, NOT} constructs they know from earlier languages like FORTRAN or BASIC. This is a bad way to write code.  It hides the higher level logic and in better SQL engines, the shorthands are optimized.  

BETWEEN

This search condition is equivalent to two comparisons:

(x BETWEEN a AND b) = ((a < = x) AND (x < = b))

The early optimizers simply expanded it out this way.  More sophisticated optimizers look at it as a 3-ary logical operator and optimize for the “between-ness” that seems to be expected

IN()

The IN()n search condition has two forms.    

<expression> [NOT] IN (<expression list>)
<expression> [NOT] IN (<single column select stmt>)

This is a bit tricky.  With either version we get a single column of expressions and compare the left hand expression to each of those values with an OR.  In effect,

x IN (1, 2, 3)

…is shorthand for …

((x = 1) OR (x = 2) OR (x = 3))

Likewise,

x NOT IN (1, 2, 3)

… is shorthand for …

NOT ((x = 1) OR (x = 2) OR (x = 3))

Which can be re-written with DeMorgan’s law to:

((x <> 1) AND (x <> 2) AND (x <> 3))

Now try to follow the same rule with

x IN (1, 2, NULL)

…is shorthand for…

((x = 1) OR (x = 2) OR (x = NULL))
((x = 1) OR (x = 2) OR UNKNOWN)  -- propagate NULLs

Let’s assume that one of the ORs is TRUE. So the other is FALSE

(TRUE OR FALSE OR UNKNOWN) 
(TRUE OR UNKNOWN) 
(TRUE) 

The only way this can return FALSE would be for x to be something other than {1, 2}.  The only way for it to return UNKNOWN would be for x to be NULL. 

Repeat the exercise with

x NOT IN (1, 2, NULL)

…is shorthand for…

NOT ((x = 1) OR (x = 2) OR (x = NULL))
NOT ((x = 1) OR (x = 2) OR UNKNOWN) -- propagate NULLs
((x <> 1) AND (x <> 2) AND NOT UNKNOWN) -- DeMorgan
(UNKNOWN) -- definition of AND

So if this was used in DDL, rows would be allowed in the table, but will always be rejected by the DML.

SOME|ANY()

There is an underused generalization of the IN() of the form:

<expression> <theta op> [SOME | ANY] (<single column select stmt>)

This is shorthand for a chain of ORs:

(<expression> <theta op> <expression_1>
  OR <expression> <theta op> <expression_2>
  OR...
<expression> <theta op> <expression_n>)

There is no difference between SOME and ANY, but sometimes the search condition reads a little better to a human. 

The IN() search condition is the case where the <theta op> is equality.  All of the 3VL problems you had with IN() are also here.

ALL

There is really, really underused (almost unknown) search condition that generalized a chain of ANDs:

<expression> <theta op> ALL (<single column select stmt>)

This is shorthand for a chain of ANDs:

(<expression> <theta op> <expression_1>
AND <expression> <theta op> <expression_2>
AND...
<expression> <theta op> <expression_n>)

The ALL search condition does not play well with the extrema functions (e.g. MAX, MIN). 

It is counter-intuitive at first that these two search condition are not the same in SQL:

x >= (SELECT MAX(y) FROM Foobar)
x >= ALL (SELECT y FROM Foobar)

But you have to remember the rules for the extrema functions — they drop out all the NULLs before returning the greater, greatest, or least values.  The (<single column select stmt>) does not drop NULLs, so you can get them in the results.  This can leave a NULL to give us an UNKNOWN. 

EXISTS()

The EXISTS() search condition is on of the very few Boolean operator we have in SQL.  It returns {TRUE, FALSE} and never UNKNOWN. This is because it works at the table level and not at the column level.  Compare this to COUNT(*) which returns the cardinality of the whole table as opposed to COUNT(<expression>).  They look alike but are totally different. 

Summary

I tell people that they need to learn to think in sets to get good with SQL.  Give up your sequential, procedural record-at-a-time mindset and trade up to declarative set processing!  See the light!  Leave the darkness! 

I probably ought to spend more time on 3VL since it is also something they never saw before.  And it is not easy.