One of the hardest concepts in learning SQL is the meaning of a NULL. Traditionally, programming languages had no concept of missing or UNKNOWN data. The closest example that most programmers ran into was the ‘not applicable’ flags in spreadsheets, or the classic TBD flags for undetermined instructors, locations, or other things in print outs.
Dr. Codd defined a NULL as the lack of a value, so talking about NULL values is wrong. In SQL, however, we have to worry about physical storage. This means we have to know the data type of the column which is holding the NULL so the compiler can do its job. From that requirement, it logically follows that we can write CAST(NULL AS < data type>)
and not just depend on automatic type conversions.
SQL is notorious for its three-value logic {TRUE, FALSE, UNKNOWN}
which results from trying to compare a NULL to something, including another NULL. Since it is not a value, NULL <> NULL
is UNKNOWN, but likewise so is NULL = NULL! This why we have the predicate <expression> IS [NOT] NULL
to check for a NULL. This predicate and the EXISTS()
function are some of the few predicates in SQL that can only return {TRUE, FALSE}.
1 2 3 4 5 |
-- Returns 'NULL IS NULL’ IF NULL IS NULL PRINT 'NULL IS NULL' ELSE PRINT 'NULL IS NOT NULL'; |
NULLs in DDL
SQL defines a PRIMARY KEY(<column list>)
as being implicitly declared NOT NULL. It is probably a good idea to go ahead and put the NOT NULL in your table declarations anyway. If the PRIMARY KEY constraint changes, then you are still safe. According to Dr. Codd and relational theory, to be a real table it must have a key. Originally, Dr. Codd said a PRIMARY KEY had to be designated, but later realized that a key is a key, so there’s no need to designate something special about one of them. This was another leftover from trying to implement relational systems on top of old file systems. In file systems, records (which are nothing like rows) come in a linear search sequence in physical storage, so tapes had to be sorted. The original sort key became the PRIMARY KEY in the new SQL products. Random access on unsorted magnetic tapes is technically possible, but it really doesn’t work.
But the relational model accommodates multiple keys in the same table. The syntax we picked for non- primary keys was a little strange. We added the UNIQUE(<column list>)
constraint which guarantees that all the rows in the table will be different. It also allows more than one column, but the columns involved can have NULLs. Remember that in the relational model, a key can’t have NULLs, But you’re only allowed to have one NULL-ed row, as if it were a value.
All of these strange rules come from the GROUP BY
clause. Without lapsing into a college algebra lesson, we have two equivalence relations, as they are called in set theory. The first is just regular old equals (=) with the extra rules about NULLs. The second relation is the GROUP BY
, in which all the NULLs are put into one equivalence class.
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE Foobar (foo_id CHAR(5) NOT NULL PRIMARY KEY, a1 INTEGER, a2 INTEGER, a3 INTEGER, CHECK ((a1 +a2 +a3) < 10) ); INSERT INTO Foobar VALUES ('test1', 1, 1, 1); --- works! INSERT INTO Foobar VALUES ('test2', 10, 10, 10); --- fails! INSERT INTO Foobar VALUES ('test3', NULL, 1, 1); --- works! |
However, in the DDL, the UNKNOWN result of the search condition in a CHECK(<search condition>)
constraint is given the ‘benefit of the doubt’ and treated the same as TRUE.
NULLs in DML
When you use it in the DML statements, it treats the UNKNOWN result the same as a FALSE. Query updates and inserts have a stronger criterion.
1 2 3 |
SELECT foo_id FROM Foobar WHERE (a1 +a2 +a3) < 10; |
returns only test1 as a result. The test2 row failed to insert. The test3 row became
“WHERE (NULL + 1 +1) < 10”
“WHERE (NULL + 2) < 10”
“WHERE NULL < 10”
“WHERE UNKNOWN” or “WHERE FALSE”
Replacing NULLs with actual values is very often handy. The SQL Server/Sybase family originally had the function ISNULL(<expression>, <non-NULL value>)
. If the <expression>
was NULL then it returns the <value>
. The data type of the result is taken from the first parameter. That particular choice can make for some funny results when parameters are not the same data types.
Microsoft now has the ANSI/ISO Standard COALESCE(<expression list>)
. The first thing to notice is that COALESCE
takes a list of expressions, which it then parses left to right to determine the highest data type in the list, which becomes the data type of the result. The list is then parsed again from left to right to find the first non-NULL value, which is returned and cast to the result data type. One common mistake beginners make with this is to put a higher data type in the list when it’s not what they really meant.
I’m often asked why we chose the word “coalesce” instead of something else. We were trying to come up with the word that would be descriptive but not so common that it might be misunderstood or used as a column name. Phil Shaw of IBM pulled out a pocket thesaurus and started going down synonyms until he came up with this.
Grouping NULLs
A common way to use the grouping relation to return TRUE if two expressions are both NULL or their values match:
1 2 3 4 5 6 |
CASE WHEN COALESCE(foobar, 'weird value that does not occur in the database') = COALESCE(barfoo, 'weird value that does not occur in the database') THEN 0 ELSE 1 END = 0 |
What if there is no ‘weird value’ you can use for the COALESCE check? You just need to add more search conditions:
1 2 3 4 |
CASE WHEN (foo = bar OR (foo IS NULL AND bar IS NULL)) THEN 0 ELSE 1 END = 0 |
While SQL Server does not yet have this construct, the SQL Standards added another comparison operator for this problem. This feature was introduced in two steps: SQL:1999 added T151, DISTINCT
predicate. The optional negation with NOT
was added by SQL:2003 as feature T152, DISTINCT
predicate with negation.
1 |
<expression> IS [NOT] DISTINCT FROM <expression> |
Note that you have to use the negated form to get the results you want. The un-negated form is not really equality. This is easy to see with a truth table:
A |
B |
A = B |
A IS NOT DISTINCT FROM B |
0 |
0 |
TRUE |
TRUE |
0 |
1 |
FALSE |
FALSE |
0 |
NULL |
UNKNOWN |
FALSE |
NULL |
NULL |
UNKNOWN |
TRUE |
Set Operators and NULLs
SQL has set operators (UNION
, INTERSECT
, EXCEPT
) which work on table expressions. Both tables have to be what we call ‘union compatible,’ which means the tables have the same structure (the corresponding columns in each table are in the same order and have compatible datatypes) , and the result will have that structure. Most people don’t know that technically the result table does not have a name nor do the columns unless you actually assigned them with a <set expression> AS <table name> (column name list>)
construct.
Set operators discard the duplicate rows and use the grouping rather than the equality relationship to discard multiple NULLs as well as duplicate values. The EXCEPT
and INTERSECT
operators also work in this way.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
CREATE TABLE T1 (x INTEGER); INSERT INTO T1 VALUES (NULL); CREATE TABLE T2 (y INTEGER); INSERT INTO T2 VALUES (NULL); -- Returns one NULL result SELECT * FROM T1 UNION SELECT * FROM T2; -- Returns one NULL result SELECT * FROM T1 INTERSECT SELECT * FROM T2; -- Returns an empty table SELECT * FROM T1 EXCEPT SELECT * FROM T2; |
OUTER JOIN and NULLs
OUTER
JOINs
come in three flavors (LEFT
, RIGHT
, FULL
) and were designed to solve an actual problem. Before the SQL-99 Standard, there was no standardized syntax for them nor a Standard definition of how they would work. Sybase in SQL Server did it one way and Oracle did it another way, then there was a product from a company called Gupta Technologies LLC that let you pick which one you wanted to use. I’m going to assume that everyone knows how an outer join works. The ‘preserved table’ is the one on the left (or right or both) side of the join operator, and the unpreserved table, if any, is the one on the opposite side. Since all datatypes in SQL must be NULLable, the values that did not match the join condition in the preserved table can be padded out with NULLs. It doesn’t matter if the original columns in the preserved table were declared NOT
NULL
because the result of the join is technically a whole new table.
OLAP and NULLs
When OLAP (online analytical processing) databases first came in, ANSI responded by defining some basic hierarchical aggregations in ANSI/ISO Standard SQL-99. They are defined as extensions to the GROUP
BY
clause. The original three were GROUPING
SET
, ROLLUP
and CUBE
. The last two are defined in the standard using the GROUPING
SET
construct. Rollup and cube are often called ‘super groups’ because they could be defined using the regular GROUP
BY
and UNION
operators. As expected, the NULLs form their own group. However, we now have a special function, GROUPING
(<column reference>)
, which returns a one if the column was created by the operation or zero otherwise. Then SQL-2003 added a multicolumn version, GROUPING
_ID
, that constructs a binary number from the zeros and ones in the columns in the list, using an implementation defined exact numeric data type; this is handy as you think.
These constructs allow you to do what we used to call ‘control break reports’ back in the pre-SQL days. You put a sequential file in sorted order and pass it through a program that would keep running totals in accumulators (an old term that actually used to refer to a physical feature in unit record equipment). When the controls (the columns at various levels in the hierarchy) changed, the accumulators were printed out and reset. Most of the time, frankly, this was doing running totals.
This is probably easier to see with an example. Imagine we have a file that gives us a region number and the city name along with the total sales for that city. We want to get a report that shows us the totals by region, the totals by city within the regions, and finally, a grand total for all sales in the company.
1 2 3 4 |
SELECT region_nbr, city_name, SUM(sale_amt) AS sales_tot FROM Regional_Sales GROUP BY ROLLUP (region_nbr, city_name) ORDER BY region_nbr, city_name; |
That’s assuming that there is some sample data, the output might look like this (the right-hand most column is a comment not part of the output)
… |
… |
… |
comment |
‘006’ |
‘Austin’ |
500.13 |
city within region total |
‘006’ |
‘Dallas’ |
2010060.5 |
|
‘006’ |
‘San Antonio’ |
475.01 |
|
… |
… |
… |
… |
‘006’ |
NULL |
1190902.75 |
region total |
… |
… |
… |
|
NULL |
NULL |
3426563.75 |
grand total |
The general rule in SQL is not to do data formatting in the database tier. You pass the results of the database layer to a presentation layer and that layer adds the colors, labels, does any weird filters and calculations, etc. The purpose of the database tier is just raw data. However, having said that you can write something like this:
1 2 3 4 5 6 7 8 |
SELECT CASE WHEN GROUPING(region_nbr) =1 THEN ‘Region Total’ ELSE region_nbr END, CASE WHEN GROUPING(city_name) = 1 THEN ‘City Total’ ELSE ‘City Name’ END, SUM(sale_amt) AS sales_tot FROM Regional_Sales GROUP BY ROLLUP (region_nbr, city_name) ORDER BY region_nbr, city_name; |
Again, you are doing something that is not recommended.
Avoiding NULLs
We debated this in the early days of ANSI X3H2. Some of the early products use the regular equality for their groupings, so each NULL became its own group. This did not work out so well. The example we had in the committee was a table of traffic tickets issued in California. Quite logically, the database designer used NULL for missing auto tags; the words none, nothing, missing, etc., (in multiple languages!) had been used on prestige tags, and the way the law was written, they were perfectly legal. There was no special checkbox on the traffic tickets for a missing tag; it had to go in the space for the tag number. The huge number of missing tags made reporting impossible when they each became one row in the summary reports.
As a default, you need to assume that all your columns will be NOT NULL, then go back and decide exactly what a NULL would mean for each particular column. If your column is on a nominal or categorical scale (see this article), you can create encodings for the missing values. For example, the ISO 5218 sex codes are (‘0’ = unknown, ‘1’ = male, ‘2’ = female, ‘9’ = not applicable or lawful person). A lawful person includes things like corporations, governments, and so forth. If a column is a temporal data type, then NULL is often used as a symbol for eternity when marking the open end of (start_timestamp, end_timestamp) intervals that have not closed yet.
Using zero for numeric data elements may or may not work as a missing value token. Blanks or empty strings may or may not work as a missing value for text. Neither of these options have any built-in special characteristics that NULLs have in SQL.
Conclusion
The simple fact is that in SQL, you really can’t escape NULLs. But think of them the way you would think about any other data design decision. Is it necessary? Is it sufficient to express the nature of the model? Is it easily understood by someone who is going to have to maintain this after you’re gone?
Load comments