Product articles SQL Prompt SQL Code Analysis
The ‘= NULL’ Mistake and…

17 June 2019

2 Comments

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

17 June 2019

2 Comments

The ‘= NULL’ Mistake and other SQL NULL Heresies

The SQL Prompt Best Practice rule checks whether a comparison or expression includes a NULL literal ('NULL'), which in SQL Server, rather than result in an error, will simply always produce a NULL result. Phil Factor explains how to avoid this, and other SQL NULL-related calamities.

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

SQL Prompt has a code analysis rule (BP011) that checks whether a comparison or expression includes a NULL literal ('NULL'). These will always produce a NULL result. To determine whether a datatype is, or isn’t, NULL, use IS NULL or IS NOT NULL.

Beyond that, working with a nullable datatype in an expression requires use of the COALESCE(), ISNULL() or NULLIF() functions, as appropriate, to deal with NULL values safely.

What is NULL?

SQL allows any datatype to have a NULL value. This isn’t the same as a blank string or a zero integer. It boils down to the meaning ‘Unknown’. If you ask SQL theoreticians, they’ll say that it represents missing information and inapplicable information.

Having a NULL string, for example, isn’t the same as knowing that you have a blank string. It means you don’t know what is in the string. If you are doing a database of your ancestors, and you put the value of NULL in for your great grandmother, it means that you don’t yet know her name, but certainly doesn’t imply that you haven’t got a great grandmother. NULL is a state, not a value. It is useful to have a state for data, and the father of the relational model, E.F.Codd, wanted two of them: ‘A-Values’ and ‘I-Values’, representing ‘Missing But Applicable’ and ‘Missing But Inapplicable’. Sadly, there is but one.

NULLs in a comparison

If you’re not used to the concept of ‘null’, and you treat it like a number such as pi, then strange perplexing things will happen with the result of SQL Expressions. Many developers get into trouble when they want a WHERE clause that filters on a column with a NULL value in it.

Imagine you have a database of Irish saints, each with their saints’ days, contained in a date column. If a saint doesn’t have a saint’s day, then this column is NULL , for that row.

The full table build with 190 saints (there are plenty more) is included with this article (CreateIrishSaintsDays) so you can play along.

You want to find all the saints that don’t have a saint’s day:

This seems to go well. No rows are returned, which surely means there are no Irish saints without a saint’s day, which is a relief. Wrong! If you are using SQL Prompt, you will see a squiggly green line under that ISD.saintsDay = NULL expression.

On investigation you will see this…

The expression ISD.saintsDay = NULL means “where the saints day is equal to something unknown“. This sort of expression must return NULL, regardless of whether you compare it to a value or to another NULL. Putting a NULL in the value for the saints day could mean we don’t know the saint’s day, or it could mean that the saint never had his or her day, so how can one ever say that something unknown is the same as something else unknown?

To be accurate, the SQL Standard says that this use of WHERE <expression> = NULL is invalid syntax and shall lead to an error message or an exception. Unfortunately, in SQL Server it doesn’t: it just returns ‘incorrect’ results.

What you meant to ask is whether the value in the datatype is unknown, not whether it is equal to something that is unknown. In other words, the SQL standard insists that you cannot use = NULL, or <> NULL, and instead should use IS NULL or IS NOT NULL.

We change it…

NULLs in expressions

It isn’t just a comparison with NULL that will always return NULL. When you do arithmetic on numeric datatypes, and one of the operands or arguments is NULL, it is likely that the result will be NULL. Imagine we were to try running this query:

Yes, it returns NULL and you get the warning from SQL Prompt with the squiggly green line for the BP011 warning. However, it is an odd mistake to make.

What is more likely is that a developer creates a variable and forgets to initialize it.

The result will be null. Obviously, if you have seven apples and you multiply that by unknown quantity of apples, then the result is an unknown quantity.

We hurriedly initialize the variable with a value.

This returns 3.14285714285714 (which isn’t quite pi, I know, but doing the Gregory-Leibniz Nilakantha approximation might be a distraction).

So, why didn’t SQL Prompt give a squiggly line when we used a null variable in an expression? It will only protest if you use a null literal (NULL) in either an expression or a condition. It is doing a static test so cannot detect if you have a NULL value in a variable. Even if it could do so, it wouldn’t be a good thing to flag, because then we’d get warning on code that uses nullable variables in the useful and valuable ways that you intended. NULLs can save a lot of tedious work in building strings from components, some of which are optional (concatenation); it also makes building lists easier and in creating reports where the data isn’t continuous.

NULLs in string concatenations

How could we use NULL to our advantage? Well, to show you an obvious example, we’ll use the person.person table in AdventureWorks that has the various parts of what we understand to be a person’s name.

Some of these constituents must be there (NOT NULL) , such as FirstName and LastName, but others can be NULL, meaning you just don’t know. You may not know a person’s preferred title of address, middle names or suffix. If you know that someone doesn’t have a middle name, for example, you can legitimately put a blank string. Alternatively, for a part of the name that isn’t necessary for searching and sorting, or identification then entering ‘unknown’ is a reasonable option. Had the creators of the data known definitively that no-one had suffixes, for example, then there would have been no need to use a nullable column to store it.

We are told by the moody boss that he needs a list of all names in the Adventureworks database, not just first and last names but all parts of the name that we have. We’ll do a subset of the ten most recent customers.

That didn’t go so well did it? Again, we didn’t get BP011 warning because SQL Prompt, as discussed, merely detects NULL literals in expressions or conditions.

There were NULLs in every row (the suffix was, in every case, unknown) and they ‘propagated’ to create a resulting NULL. It makes sense. If you concatenate strings and you don’t know what is in one of the strings, the result of the concatenation must indeed be NULL. From the perspective of SQL Server, you asked a silly question and got a silly, but correct, answer.

What you meant to say was ‘if you don’t know it, leave that bit out’, and for this task, the COALESCE function is your friend. You just give it a list of parameters. It evaluates each in turn until it finds one that is not NULL, and it returns this as a value. If they are all NULL, it returns a NULL. In this case, we can use it to return a blank string if the value in the column is NULL. Its potential power in dealing with NULLs is great.

After your excitement dies down, you notice that the list doesn’t left-align; there are extra spaces here and there that will trip you up subsequently. You could have tidied this up with a mix of LTRIM, RTRIM and REPLACE, but then other database developers would smirk at it when doing code reviews. So, you do this:

Suddenly, you are over the pain barrier and letting NULLs work for you. If there is a value in the Title column, you add a space to it and return it. If the Title column has no value, then NULL, when added to a string with a space in it, still results in a NULL, and so the COALESCE function examines the next parameter in the list and returns the blank string. Some columns in the list are nullable and some aren’t, so we can put spaces, or any other delimiter we want, between the column values, according to whether they are NULL. We’ve done all this with only slight changes.

However, the moody boss is not happy. It turns out he wanted a comma-delimited list. OK. We now know all about dealing with delimiters because a space character is a delimiter of sorts.

We create a variable, which by its very nature is nullable. We want a comma delimiter between each string and if we just join strings together that we’ve added to a ‘,’ delimiter, we get a comma at the head of the list. To remove it, we use the COALESCE function to make sure that the first name isn’t preceded with a comma. If the variable has never been set it is NULL, so it is easy to have special conditions for the first string to be added to a variable. If you add a comma to a NULL you get a NULL, so before any values are inserted into the variable, you won’t get the initial delimiter.

Which gives the result that we want:

However, the moody boss now comes up with an odd request. He only wants people in the list who have a single initial, for middle name, or nothing at all. Those with full middle names will be dealt with in a separate list.

Simple. We just exclude those whose middle-names are more than one character long:

Unfortunately, this does not work. It gives us all those with single initials (so we no longer see any who have a single initial followed by a dot). However, we’ve lost all the people with no middle name:

If the middle name is NULL, then the answer to “is NULL less that 2” is unknown, and the row is excluded. COALESCE() would again have come to the rescue.

This now gives the correct answer.

COALESCE() is by far the most useful function for dealing with nullable columns or variables safely. However, there are two other functions that I’ll mention just briefly:

  • ISNULL() – does less than COALESCE() and could almost have been devised to confuse the unwary because it looks like IS NULL, which means something else. It is also not standard SQL.
  • NULLIF() – returns NULL if two expressions passed as parameters are equal. This is SQL Standard, but is just syntactic candy on top of the CASE expression, which is easier to understand.

For more details see How to Get NULLs Horribly Wrong in SQL Server.

NULLs in aggregate values

Now, if you are adding up a list of numbers, you will want the NULLs to propagate. If you add a list of numbers, and you don’t know one of the numbers, then the result should be NULL, meaning that there is no way of knowing what the sum of that list is. Let’s put this to the test.

Surprisingly, from what we’ve learned so far, this SQL aggregation expression returns 11 for the sum of the first column rather than NULL…

…but it will warn you that NULL values were eliminated.

Why was the NULL ignored rather than propagated? The SQL Standard, I believe, dropped the ball. Officially, ‘The inconsistencies in the SQL standard mean that it is not possible to ascribe any intuitive logical semantics to the treatment of nulls in SQL”.

The SQL Standard decided that all aggregate functions except COUNT(*) and grouping functions should ignore NULLs. The COUNT(*) function returns either a number or zero. All the other aggregate functions will ignore rows that contain NULL and return NULL only if the data set contains no rows or contains only rows with NULLs as arguments to the aggregate function.

Ignoring NULLs is not the same as replacing NULL with zero. We can demonstrate this with our existing example by getting the average calculated in different ways.

For the TheValue column, which had a NULL value, the calculated average wasn’t the same as the average produced by the aggregate function. Why was that? The COUNT(*) returned the number of rows whether or not they contained null values, whereas the AVG function ignored the row if the column contained a NULL.

What I should have done was to specify the column in the COUNT() aggregate. Then it can ignore rows whenever that column has a NULL in it:

OK. We’re feeling less confused now. When you ask for the sum or average, you are asking for all non-NULL values in the column to be aggregated. If you are doing calculations which rely on the value of the count of the values, you must always specify the column in the COUNT() aggregate so as to count only the non-null rows for that column.

NULLs with grouping, sorting and deduplicating

The SQL Standard says that NULL is a state of a datatype rather than a value. So, how then do you sort data? You might expect from the behavior of NULL values in expressions that sorting by a nullable column will result in those rows that have a NULL value in a sort column being scattered randomly around the sort order. Not so. The NULLs are sorted as if they were the same value, all together. Where in the sort order should they be? The SQL Standard allows you to specify whether they float to the top or sink to the bottom of the sort order with the keywords NULLS FIRST and NULLS LAST. SQL Server is always NULLS FIRST.

We can try this with the saints of Ireland and their saint’s days.

…and…

You will find that in both cases, the result starts with NULLs.

Like sorting, grouping is done as if all NULL values are considered equal, in contrast to the way that they are treated in an expression. They are collected into a single group.

62 saints have an unknown or non-existent saint’s day.

What about UNION when a column has NULLs in it? The UNION operation will remove all duplicate rows. However, what if two rows both have NULLs for one or columns but are otherwise-identical? If the database engine were unable to compare NULLs, then it would be unable to say ‘deduplicate’ during UNION operations. We can easily test this.

The result is still 190, which is the number in the original table. The UNION operation compared rows with NULLs but were otherwise the same, decided they were identical and removed the duplicate row!

If we had done a UNION ALL instead of a UNION, we’d have found that there were 272 rows because UNION ALL doesn’t remove duplicates from the result.

SQL also gives you the answer you expect ,if you use DISTINCT:

Again, only 190 saints are found.

To cut a long story short, whether you use PARTITION BY, DISTINCT, GROUP BY, UNION, INTERSECT, ORDER BY and EXCEPT, SQL plays nice, figures out the answer you expect, and treats NULL values as if they were a single value, in contrast to the way that it treats then in expressions or comparisons.

You will, however, be thinking to yourself, like professor Ron van der Meyden, ‘The inconsistencies in the SQL standard mean that it is not possible to ascribe any intuitive logical semantics to the treatment of nulls in SQL‘.

Summary

NULL values are an essential part of the relational database, because data can be both intentionally and unintentionally missing, and it can be unknown.

NULL can be assigned, but using ‘= NULL‘, ‘<> NULL‘, or any other comparison operator, in an expression with NULL as a value, is illegal in SQL and should trigger an error. It can never be correct. The expression will always return NULL.

Comparisons or arithmetic involving either a variable datatype or a nullable column are unlikely to be safe unless used with COALESCE(), NULLIF() or ISNULL(). However these functions can be very handy, if used guardedly.

Grouping, sorting and ‘uniquifying’ will behave intuitively by ignoring rows where the columns being specified have NULLs in them, but when doing sums and averages, for example, care needs to be taken to specify the column in the COUNT() aggregate, rather than using the '*' parameter.

NULLs can trip up a programmer, yet provide a very versatile way of handling data, providing a range of software devices for dealing successfully with commercial database processes and reports. Don’t avoid NULLs; instead, get familiar with them and their use.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter
  • DotaThinker

    I was expecting this article to mention the RELEVANT SET options (CONCAT_NULL_YIELDS_NULL, ANSI_NULLS).
    Also wouldn’t CONCAT function be better for merging the name parts instead of COALESCING each part?

    • Phil Factor

      Thank you for your helpful and interesting comments. I’m afraid that I can’t see how CONCAT() would perform the same trick, but if you are on SQL Server 2017 or higher, CONCAT_WS() can can cleanly concatenate strings that might have NULL values just as long as the separation string is the same in all cases. (It would work with my example). The reason I did it the old way was because I wanted just to illustrate how NULLs could be useful in formatting columns or variables into strings correctly. I’ll add a note to remind readers to consider CONCAT_WS for doing this example.
      I chose not to mention CONCAT_NULL_YIELDS_NULL and ANSI_NULLS because they are both deprecated, but I can see that it might be an idea to warn people against the confusing consequences of someone using them to turn ISO SQL Compliance off..

You may also like

  • Article

    SQL Prompt Tip: how to control when the suggestion box pops up

    By default, SQL Prompt shows code auto-completion suggestions automatically, and continuously. Phil Factor shows how to control this behavior, for the times when you need it to be a little less intrusive, such as when working through more intricate coding problems that require careful thought.

  • Article

    How to implement a T-SQL code formatting standard using SQL Prompt

    To what extent should a team attempt to impose a uniform SQL coding format? It can sometimes seem like a futile task. All programmers have a preferred formatting style, to which they profess an unwavering attachment. Check the code of ten programmers, though, and you will likely discover at least 15 different formats, since few

  • Article

    Using a Variable-length Datatype Without Explicit Length: The Whys and Wherefores (BP007/8)

    If you declare a variable-length string , or coerce a string, without specifying its length, you can fall foul of ‘silent’ string truncation. Some developers resort to using the (MAX) specification, which is a mistake too. Phil Factor explains the dangers and then offers a workaround for the problem, when you're importing text and simply don't know the correct length of each string.

  • Community event

    Raleigh Code Camp

    Redgate will be remotely sponsoring the Raleigh Code Camp 2019. Code Camp is a place for local software developers to come and learn from their peers. This community-driven event has become an international trend where peer groups of all platforms, programming languages and disciplines band together to bring content to the community.