{"id":82836,"date":"2019-01-10T16:22:05","date_gmt":"2019-01-10T16:22:05","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=82836"},"modified":"2021-06-03T16:48:19","modified_gmt":"2021-06-03T16:48:19","slug":"empty-thoughts-working-with-null","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/empty-thoughts-working-with-null\/","title":{"rendered":"Empty Thoughts: Working with NULL"},"content":{"rendered":"<p>One of the hardest concepts in learning SQL is the meaning of a NULL. Traditionally, programming languages had no concept of missing or <em>UNKNOWN<\/em> data. The closest example that most programmers ran into was the \u2018not applicable\u2019 flags in spreadsheets, or the classic <em>TBD<\/em> flags for undetermined instructors, locations, or other things in print outs.<\/p>\n<p>Dr. Codd defined a NULL as the lack of a value, so talking about <em>NULL values<\/em> is wrong. In SQL, however, we have to worry about <em>physical<\/em> storage. This means we have to know the data type of the column which is holding the NULL so the compiler can do its job. From that requirement, it logically follows that we can write <code>CAST(NULL AS &lt; data type&gt;)<\/code> and not just depend on automatic type conversions.<\/p>\n<p>SQL is notorious for its three-value logic <code>{TRUE, FALSE, UNKNOWN}<\/code> which results from trying to compare a NULL to something, including another NULL. Since it is not a value, <code>NULL &lt;&gt; NULL<\/code> is <em>UNKNOWN<\/em>, but likewise so is <em>NULL = NULL<\/em>! This why we have the predicate <code>&lt;expression&gt; IS [NOT] NULL<\/code> to check for a NULL. This predicate and the <code>EXISTS()<\/code>function are some of the few predicates in SQL that can only return {TRUE, FALSE}.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">-- Returns 'NULL IS NULL\u2019\r\nIF NULL IS NULL\r\n PRINT 'NULL IS NULL'\r\nELSE\r\n PRINT 'NULL IS NOT NULL';<\/pre>\n<h2>NULLs in DDL<\/h2>\n<p>SQL defines a <code>PRIMARY KEY(&lt;column list&gt;)<\/code> as being implicitly declared <em>NOT NULL<\/em>. It is probably a good idea to go ahead and put the <em>NOT NULL<\/em> in your table declarations anyway. If the PRIMARY KEY constraint changes, then you are still safe. According to Dr. Codd and relational theory, to be a real table it must have a key. Originally, Dr. Codd said a PRIMARY KEY had to be designated, but later realized that a key is a key, so there\u2019s no need to designate something special about one of them. This was another leftover from trying to implement relational systems on top of old file systems. In file systems, records (which are nothing like rows) come in a linear search sequence in physical storage, so tapes had to be sorted. The original sort key became the PRIMARY KEY in the new SQL products. Random access on unsorted magnetic tapes is technically possible, but it really doesn\u2019t work.<\/p>\n<p>But the relational model accommodates multiple keys in the same table. The syntax we picked for non- primary keys was a little strange. We added the <code>UNIQUE(&lt;column list&gt;)<\/code> constraint which guarantees that all the rows in the table will be different. It also allows more than one column, but the columns involved can have NULLs. Remember that in the relational model, a key can\u2019t have NULLs, But you\u2019re only allowed to have one NULL-ed row, as if it were a value.<\/p>\n<p>All of these strange rules come from the <code>GROUP BY<\/code> clause. Without lapsing into a college algebra lesson, we have two equivalence relations, as they are called in set theory. The first is just regular old equals (=) with the extra rules about NULLs. The second relation is the <code>GROUP BY<\/code>, in which all the NULLs are put into one equivalence class.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE TABLE Foobar\r\n(foo_id CHAR(5) NOT NULL PRIMARY KEY, \r\n a1 INTEGER, \r\n a2 INTEGER, \r\n a3 INTEGER, \r\nCHECK ((a1 +a2 +a3) &lt; 10)\r\n);\r\nINSERT INTO Foobar VALUES ('test1', 1, 1, 1); --- works!\r\nINSERT INTO Foobar VALUES ('test2', 10, 10, 10); --- fails!\r\nINSERT INTO Foobar VALUES ('test3', NULL, 1, 1); --- works!<\/pre>\n<p>However, in the DDL, the <em>UNKNOWN<\/em> result of the search condition in a <code>CHECK(&lt;search condition&gt;)<\/code> constraint is given the \u2018benefit of the doubt\u2019 and treated the same as <em>TRUE<\/em>.<\/p>\n<h2>NULLs in DML<\/h2>\n<p>When you use it in the DML statements, it treats the <em>UNKNOWN<\/em> result the same as a <em>FALSE<\/em>. Query updates and inserts have a stronger criterion.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT foo_id\r\n FROM Foobar\r\nWHERE (a1 +a2 +a3) &lt; 10;<\/pre>\n<p>returns only <em>test1<\/em> as a result. The <em>test2<\/em> row failed to insert. The <em>test3<\/em> row became<\/p>\n<p>\u201cWHERE (NULL + 1 +1) &lt; 10\u201d<\/p>\n<p>\u201cWHERE (NULL + 2) &lt; 10\u201d<\/p>\n<p>\u201cWHERE NULL &lt; 10\u201d<\/p>\n<p>\u201cWHERE UNKNOWN\u201d or \u201cWHERE FALSE\u201d<\/p>\n<p>Replacing NULLs with actual values is very often handy. The SQL Server\/Sybase family originally had the function <code>ISNULL(&lt;expression&gt;, &lt;non-NULL value&gt;)<\/code>. If the <code>&lt;expression&gt;<\/code> was NULL then it returns the <code>&lt;value&gt;<\/code>. The data type of the result is taken from the first parameter. That particular choice can make for some funny results when parameters are not the same data types.<\/p>\n<p>Microsoft now has the ANSI\/ISO Standard <code>COALESCE(&lt;expression list&gt;)<\/code>. The first thing to notice is that <code>COALESCE<\/code> takes a <em>list<\/em> of expressions, which it then parses left to right to determine the highest data type in the list, which becomes the data type of the result. The list is then parsed again from left to right to find the first non-NULL value, which is returned and cast to the result data type. One common mistake beginners make with this is to put a higher data type in the list when it\u2019s not what they really meant.<\/p>\n<p>I\u2019m often asked why we chose the word \u201ccoalesce\u201d instead of something else. We were trying to come up with the word that would be descriptive but not so common that it might be misunderstood or used as a column name. Phil Shaw of IBM pulled out a pocket thesaurus and started going down synonyms until he came up with this.<\/p>\n<h2>Grouping NULLs<\/h2>\n<p>A common way to use the grouping relation to return <em>TRUE<\/em> if two expressions are both NULL or their values match:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CASE WHEN \r\n COALESCE(foobar, 'weird value that does not occur in the database') \r\n = COALESCE(barfoo, 'weird value that does not occur in the database')\r\n THEN 0\r\n ELSE 1\r\n END = 0<\/pre>\n<p>What if there is no \u2018weird value\u2019 you can use for the COALESCE check? You just need to add more search conditions:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CASE WHEN (foo = bar OR (foo IS NULL AND bar IS NULL))\r\n THEN 0\r\n ELSE 1\r\n END = 0<\/pre>\n<p>While SQL Server does not yet have this construct, the SQL Standards added another comparison operator for this problem. This feature was introduced in two steps: SQL:1999 added T151, <code>DISTINCT<\/code> predicate. The optional negation with <code>NOT<\/code> was added by SQL:2003 as feature T152, <code>DISTINCT<\/code> predicate with negation.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">&lt;expression&gt; IS [NOT] DISTINCT FROM &lt;expression&gt;<\/pre>\n<p>Note that you have to use the negated form to get the results you want. The un-negated form is not really equality. This is easy to see with a truth table:<\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p>A<\/p>\n<\/td>\n<td>\n<p>B<\/p>\n<\/td>\n<td>\n<p>A = B<\/p>\n<\/td>\n<td>\n<p>A IS NOT DISTINCT FROM B<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>0<\/p>\n<\/td>\n<td>\n<p>0<\/p>\n<\/td>\n<td>\n<p>TRUE<\/p>\n<\/td>\n<td>\n<p>TRUE<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>0<\/p>\n<\/td>\n<td>\n<p>1<\/p>\n<\/td>\n<td>\n<p>FALSE<\/p>\n<\/td>\n<td>\n<p>FALSE<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>0<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>UNKNOWN<\/p>\n<\/td>\n<td>\n<p>FALSE<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>UNKNOWN<\/p>\n<\/td>\n<td>\n<p>TRUE<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h2>Set Operators and NULLs<\/h2>\n<p>SQL has set operators (<code>UNION<\/code>, <code>INTERSECT<\/code>, <code>EXCEPT<\/code>) which work on table expressions. Both tables have to be what we call \u2018union compatible,\u2019 which means the tables have the same structure (the corresponding columns in each table are in the same order and have compatible datatypes) , and the result will have that structure. Most people don\u2019t know that technically the result table does not have a name nor do the columns unless you actually assigned them with a <code>&lt;set expression&gt; AS &lt;table name&gt; (column name list&gt;)<\/code> construct.<\/p>\n<p>Set operators discard the duplicate rows and use the grouping rather than the equality relationship to discard multiple NULLs as well as duplicate values. The <code>EXCEPT<\/code> and <code>INTERSECT<\/code> operators also work in this way.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">CREATE TABLE T1 (x INTEGER);\r\n INSERT INTO T1 VALUES (NULL);\r\nCREATE TABLE T2 (y INTEGER);\r\n INSERT INTO T2 VALUES (NULL);\r\n-- Returns one NULL result \r\n SELECT * FROM T1 \r\nUNION \r\nSELECT * FROM T2;\r\n-- Returns one NULL result \r\nSELECT * FROM T1 \r\nINTERSECT \r\nSELECT * FROM T2;\r\n-- Returns an empty table\r\nSELECT * FROM T1 \r\nEXCEPT \r\nSELECT * FROM T2;<\/pre>\n<h2>OUTER JOIN and NULLs<\/h2>\n<p><code>OUTER<\/code> <code>JOINs<\/code> come in three flavors (<code>LEFT<\/code>, <code>RIGHT<\/code>, <code>FULL<\/code>) and were designed to solve an actual problem. Before the SQL-99 Standard, there was no standardized syntax for them nor a Standard definition of how they would work. Sybase in SQL Server did it one way and Oracle did it another way, then there was a product from a company called Gupta Technologies LLC that let you pick which one you wanted to use. I\u2019m going to assume that everyone knows how an outer join works. The \u2018preserved table\u2019 is the one on the left (or right or both) side of the join operator, and the unpreserved table, if any, is the one on the opposite side. Since all datatypes in SQL must be NULLable, the values that did not match the join condition in the preserved table can be padded out with NULLs. It doesn\u2019t matter if the original columns in the preserved table were declared <code>NOT<\/code> <code>NULL<\/code> because the result of the join is technically a whole new table.<\/p>\n<h2>OLAP and NULLs<\/h2>\n<p>When OLAP (online analytical processing) databases first came in, ANSI responded by defining some basic hierarchical aggregations in ANSI\/ISO Standard SQL-99. They are defined as extensions to the <code>GROUP<\/code> <code>BY<\/code> clause. The original three were <code>GROUPING<\/code> <code>SET<\/code>, <code>ROLLUP<\/code> and <code>CUBE<\/code>. The last two are defined in the standard using the <code>GROUPING<\/code> <code>SET<\/code> construct. Rollup and cube are often called \u2018super groups\u2019 because they could be defined using the regular <code>GROUP<\/code> <code>BY<\/code> and <code>UNION<\/code> operators. As expected, the NULLs form their own group. However, we now have a special function, <code>GROUPING<\/code> <code>(&lt;column reference&gt;)<\/code>, which returns a one if the column was created by the operation or zero otherwise. Then SQL-2003 added a multicolumn version, <code>GROUPING<\/code>_<code>ID<\/code>, that constructs a binary number from the zeros and ones in the columns in the list, using an implementation defined exact numeric data type; this is handy as you think.<\/p>\n<p>These constructs allow you to do what we used to call \u2018control break reports\u2019 back in the pre-SQL days. You put a sequential file in sorted order and pass it through a program that would keep running totals in accumulators (an old term that actually used to refer to a physical feature in unit record equipment). When the controls (the columns at various levels in the hierarchy) changed, the accumulators were printed out and reset. Most of the time, frankly, this was doing running totals.<\/p>\n<p>This is probably easier to see with an example. Imagine we have a file that gives us a region number and the city name along with the total sales for that city. We want to get a report that shows us the totals by region, the totals by city within the regions, and finally, a grand total for all sales in the company.<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT region_nbr, city_name, SUM(sale_amt) AS sales_tot\r\n FROM Regional_Sales\r\nGROUP BY ROLLUP (region_nbr, city_name)\r\nORDER BY region_nbr, city_name; <\/pre>\n<p>That\u2019s assuming that there is some sample data, the output might look like this (the right-hand most column is a comment not part of the output)<\/p>\n<p>&nbsp;<\/p>\n<table>\n<tbody>\n<tr>\n<td>\n<p>&#8230;<\/p>\n<\/td>\n<td>\n<p>\u2026<\/p>\n<\/td>\n<td>\n<p>\u2026<\/p>\n<\/td>\n<td>\n<p>comment<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>&#8216;006&#8217;<\/p>\n<\/td>\n<td>\n<p>\u2018Austin\u2019<\/p>\n<\/td>\n<td>\n<p>500.13<\/p>\n<\/td>\n<td>\n<p>\uf0ac\uf020city within region total<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>&#8216;006&#8217;<\/p>\n<\/td>\n<td>\n<p>\u2018Dallas\u2019<\/p>\n<\/td>\n<td>\n<p>2010060.5<\/p>\n<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td>\n<p>&#8216;006&#8217;<\/p>\n<\/td>\n<td>\n<p>\u2018San Antonio\u2019<\/p>\n<\/td>\n<td>\n<p>475.01<\/p>\n<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td>\n<p>\u2026<\/p>\n<\/td>\n<td>\n<p>&#8230;<\/p>\n<\/td>\n<td>\n<p>\u2026<\/p>\n<\/td>\n<td>\n<p>\u2026<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>&#8216;006&#8217;<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>1190902.75<\/p>\n<\/td>\n<td>\n<p>\uf0ac\uf020region total<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td>\n<p>\u2026<\/p>\n<\/td>\n<td>\n<p>\u2026<\/p>\n<\/td>\n<td>\n<p>&#8230;<\/p>\n<\/td>\n<td>&nbsp;<\/td>\n<\/tr>\n<tr>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>NULL<\/p>\n<\/td>\n<td>\n<p>3426563.75<\/p>\n<\/td>\n<td>\n<p>\uf0ac\uf020grand total<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The general rule in SQL is not to do data formatting in the database tier. You pass the results of the database layer to a presentation layer and that layer adds the colors, labels, does any weird filters and calculations, etc. The purpose of the database tier is just raw data. However, having said that you can write something like this:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT CASE WHEN GROUPING(region_nbr) =1\r\n                           THEN \u2018Region Total\u2019 ELSE region_nbr END,\r\n                CASE WHEN GROUPING(city_name) = 1\r\n                            THEN \u2018City Total\u2019 ELSE \u2018City Name\u2019 END, \r\n               SUM(sale_amt) AS sales_tot\r\n FROM Regional_Sales\r\nGROUP BY ROLLUP (region_nbr, city_name)\r\nORDER BY region_nbr, city_name; <\/pre>\n<p>Again, you are doing something that is not recommended.<\/p>\n<h2>Avoiding NULLs<\/h2>\n<p>We debated this in the early days of ANSI X3H2. Some of the early products use the regular equality for their groupings, so each NULL became its own group. This did not work out so well. The example we had in the committee was a table of traffic tickets issued in California. Quite logically, the database designer used NULL for missing auto tags; the words <em>none<\/em>, <em>nothing<\/em>, <em>missing<\/em>, etc., (in multiple languages!) had been used on prestige tags, and the way the law was written, they were perfectly legal. There was no special checkbox on the traffic tickets for a missing tag; it had to go in the space for the tag number. The huge number of missing tags made reporting impossible when they each became one row in the summary reports.<\/p>\n<p>As a default, you need to assume that all your columns will be NOT NULL, then go back and decide exactly what a NULL would mean for each particular column. If your column is on a nominal or categorical scale (see this <a href=\"https:\/\/www.sqlservercentral.com\/steps\/stairway-to-data-level-5-types-of-scales-part-i\">article<\/a>), you can create encodings for the missing values. For example, the ISO 5218 sex codes are (\u20180\u2019 = unknown, \u20181\u2019 = male, \u20182\u2019 = female, \u20189\u2019 = not applicable or lawful person). A <em>lawful person<\/em> includes things like corporations, governments, and so forth. If a column is a temporal data type, then NULL is often used as a symbol for <em>eternity<\/em> when marking the open end of (start_timestamp, end_timestamp) intervals that have not closed yet.<\/p>\n<p>Using zero for numeric data elements may or may not work as a missing value token. Blanks or empty strings may or may not work as a missing value for text. Neither of these options have any built-in special characteristics that NULLs have in SQL.<\/p>\n<h2>Conclusion<\/h2>\n<p>The simple fact is that in SQL, you really can\u2019t escape NULLs. But think of them the way you would think about any other data design decision. Is it necessary? Is it sufficient to express the nature of the model? Is it easily understood by someone who is going to have to maintain this after you\u2019re gone?<\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Whether or not to have NULLable columns in a table can be a religious debate, and how missing data is represented should be carefully considered during database design. In this article, Joe Celko considers the ways that SQL Server handles NULLs in several situations.  &hellip;<\/p>\n","protected":false},"author":96214,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143531],"tags":[5134],"coauthors":[6781],"class_list":["post-82836","post","type-post","status-publish","format-standard","hentry","category-featured","category-t-sql-programming-sql-server","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/82836","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/96214"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=82836"}],"version-history":[{"count":8,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/82836\/revisions"}],"predecessor-version":[{"id":84736,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/82836\/revisions\/84736"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=82836"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=82836"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=82836"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=82836"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}