1 March 2018

8 Comments

1 March 2018

8 Comments

Consider using [NOT] EXISTS instead of [NOT] IN with a subquery (PE019)

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

It used to be that the EXISTS logical operator was faster than IN, when comparing data sets using a subquery. For example, in cases where the query had to perform a certain task, but only if the subquery returned any rows, then when evaluating WHERE [NOT] EXISTS (subquery), the database engine could quit searching as soon as it had found just one row, whereas WHERE [NOT] IN (subquery) would always collect all the results from the sub-query, before further processing.

However, the query optimizer now treats EXISTS and IN the same way, whenever it can, so you’re unlikely to see any significant performance differences. Nevertheless, you need to be cautious when using the NOT IN operator if the subquery’s source data contains NULL values. If so, you should consider using a NOT EXISTS operator instead of NOT IN, or recast the statement as a left outer join.

A recommendation to prefer use of [NOT] EXISTS over [NOT] IN is included as a code analysis rule in SQL Prompt (PE019).

Which performs better: EXISTS or IN….?

There are many ways of working out the differences between two data sets, but two of the most common are to use either the EXISTS or the IN logical operator. Imagine that we have two simple tables, one with all the common words in the English language (CommonWords), and the other with a list of all the words in Bram Stoker’s ‘Dracula’ (WordsInDracula). The TestExistsAndIn download includes the script to create these two tables and populate each one from its associated text file. It is useful, generally, to have tables like these in your sandbox server, for running tests while doing development work, though the book you use is your choice!

How many words occur in Dracula that aren’t common words? Assuming there are no NULL values in the CommonWords.Word column (more on this later), then the following queries will return the same result (1555 words), and have the same execution plan, which uses a Merge Join (Right Anti Semi Join) between the two tables.

Listing 1

In short, the SQL Server optimizer treats either query in the same way, and they will perform the same too.

…or ANY, EXCEPT, INNER JOIN, OUTER JOIN or INTERSECT…?

What about all the other possible techniques, though, such as using ANY, EXCEPT, INNER JOIN, OUTER JOIN or INTERSECT? Listing 2 shows seven further alternatives that I could think of easily, though there will be others.

Listing 2

The Test Harness

All nine queries give the same results, but does any one approach perform better? Let’s put them all in a simple test harness to see how long each version takes! Again, the code download file includes the test harness code, and all nine queries.

As the results show, although the queries look rather different, it’s generally just ‘syntactic sugar’ to the optimizer. However elegant your SQL, the optimizer merely shrugs and comes up with an efficient plan to execute it. In fact, the first four all use the exact same ‘right anti semi merge join’ execution plan, and all take the same amount of time.

We’ll check for variation by running the test several times. The INTERSECT and INNER JOIN queries both used an inner merge join, and were close. The two FULL OUTER JOIN queries were a bit slower, but it was a close race.

The pitfall of NOT IN

There is a certain unreality in comparing sets with null values in them, but if it happens in the heat of everyday database reporting, things can go very wrong. If you have a NULL value in the result of the subquery, or expression, that is passed to the IN logical operator, it will give a reasonable response, and the same as the equivalent EXISTS. However, NOT IN behaves very differently.

Listing 3 demonstrates the problem. We insert three common and three uncommon words into a @someWord table variable, and we want to know the number of common words that aren’t in our table variable.

Listing 3.

The NOT IN query, before we inserted a NULL into @someword, and both the NOT EXISTS queries, all tell us correctly that 60385 words are not in our table variable, because three are, and there are 60388 common words in all. However, if the subquery can return a NULL, then NOT IN returns no rows at all.

NULL really means ‘unknown’ rather than nothing, which is why any expression that compare to a NULL value returns NULL, or unknown.

Logically, SQL Server evaluates the subquery, replaces it with the list of values it returns, and then evaluates the [NOT] IN condition. For the IN variant of our query, this does not cause a problem because it resolves to the following:

This returns 3 rows, for the matches on the ‘z…’ words. The sting comes with NOT IN, which resolves to the following:

The AND condition with a comparison to NULL evaluates as ‘unknown’ , and so the expression will always return zero rows. This isn’t a bug; it is by design. You can argue that a NULL shouldn’t be allowed in the any column where you want to use a NOT IN expression, but in our real working lives, these things can creep into table sources. It is worth being cautious. So, use the EXISTS variant, or one of the others, or always remember to include a WHERE clause in the IN condition to eliminate the NULLs.

Downloads

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

You may also like

  • Article

    Finding code smells using SQL Prompt: TOP without ORDER BY in a SELECT statement (BP006)

    Using TOP in a SELECT statement without a subsequent ORDER BY clause is legal in SQL Server, but meaningless because asking for the TOP 10 rows implies that the data is guaranteed to be in a certain order, and tables have no implicit logical order. You must specify the order. In a SELECT statement, you

  • Article

    Finding code smells using SQL Prompt: Asterisk in SELECT list (BP005)

    Using SELECT * FROM within IF EXISTS statements are fine, but in other contexts it causes several problems. For example, INSERT INTO…SELECT * FROM statement assumes the columns are in a particular order, and that order may not last. Also, selecting all the columns in a table when you don’t need them all can cause

  • Article

    3 Results Grid features SQL Prompt brings to SSMS

    You may not have heard about it yet, but three helpful features were recently added to SQL Prompt that are activated when you right-click on the Results Grid in SQL Server Management Studio (SSMS). 1. Copy as IN clause This feature is available when you have values from a single column selected. You can select

  • Article

    How to Apply Non-Standard SQL Formatting Using SQL Prompt

    I use SQL Prompt to format my code; I’ve defined my standard style that lays out the code just the way I like it, for most cases. I also have a couple of alternative styles for specific tasks, such as comparing two versions of the same code. Occasionally, however, the standard formatting just won’t work

  • University

    Take the SQL Prompt course

    This Redgate University course takes you from installation all the way up to getting the most out of the advanced operations in SQL Prompt. As well as autocompleting your code, you’ll learn how SQL Prompt can help you with code formatting, object renaming, code analysis, and other useful tip and tricks.

  • Forums

    SQL Prompt Forum

    Write, format, and refactor SQL effortlessly

  • Gary Rumble

    “The AND condition with a compassion to NULL evaluates as ‘unknown’”

    I’m interested in the concept of compassionate ands. Do we also have dispassionate ors? Perhaps those would be nots…

    • Tony Davis

      ha! One of our finer typos! It’s fixed (changed to “comparison”) , though I was briefly enamored with the idea of SQL Server doing more compassionate evaluations, based on what the programmer *probably* wanted…

      • Phil Factor

        Compassionate, in terms of taking pity of the struggling programmer

  • SAinCA

    Hi Phil. Thanks for the informative post. Can you state which version of the optimizer started to treat IN and EXISTS similarly (when possible). Having to still labor with 2008R2 instances, I’d like to make sure our code uses EXISTS as a preference until all old SQL versions are upgraded. Thanks!

    • Phil Factor

      We tested it in SQL Server 2008 and they were the same there. Unfortunately I’ve lost my 2000 ad 2005 instance

      • SAinCA

        Thanks. Duly noted.

  • Mike

    I generally like to avoid IN, especially Not In with any data modification queries except with hard wired values. I’ve had odd behaviour due to nulls in the past.

    But here’s a new one that scares me… how the hell can this happen…

    Create Table TestDeleteRows(Column1 int, Column2 nvarchar(10))
    Create Table TestDeleteRowsValue(Column1 int)

    insert into TestDeleteRows
    Values (1, ‘1st value’), (2, ‘2nd value’), (3, ‘3rd value’), (4, ‘4th value’), (5, ‘5th value’)
    insert into TestDeleteRowsValue
    Values(1),(3)

    Delete from TestDeleteRows
    Where Column1 in(Select Column1 from TestDeleteRowsValue)
    –2 rows as expect
    –run it again
    Delete from TestDeleteRows
    Where Column1 in(Select Column1 from TestDeleteRowsValue)
    –0 rows as expected

    EXEC sp_rename ‘TestDeleteRowsValue.Column1’, ‘Column11111’, ‘COLUMN’;

    –and now run it again
    Delete from TestDeleteRows
    Where Column1 in(Select Column1 from TestDeleteRowsValue)
    –3 rows deleted, although the sub query in the In clause won’t compile the outer query still executes and for some reason matches every row

    • Mike

      Hang on, just realised, it now compiles the sub query using the outer queries Column1 table name. Still dangerous in my eyes as it makes no sense to write a query in that manner.

      Mostly if using IN with a sub query you would be referencing a foreign key which 95+% of the time (for me) has the same name as the primary key, but if rushed (usually) those 5% of the times could catch you out.

      Although it seems this behaves similarly for exists…

      Delete from TestDeleteRows
      Where exists (Select Column1 from TestDeleteRowsValue)

      …it is never written this way (because again it makes no sense) but instead something like…

      Delete from TestDeleteRows
      Where exists (Select * from TestDeleteRowsValue tdrv where tdrv.Column3 = column1)

      and so not likely to occur.

  • Pingback: How to Write a Better T-SQL Code – Igor Micev()