How to Avoid Conditional JOINs in T-SQL

Comments 0

Share to social media

Just because certain SQL JOINs are possible doesn’t mean that they are a good option.  The ‘conditional JOIN’ is one of these.

So what is a conditional join? This is where one key in a table is used  to join either with one table or others, depending on some criterion. It sounds extraordinary, but it can happen, particularly in highly complex, sometimes auto-generated queries. It usually represents an attempt to use a polymorphic association between tables, and is a SQL Code Smell.

The term  ‘Conditional JOIN’ is somewhat ambiguous, in that people refer to different problems when they use it.  It seems as if programmers wish to either JOIN a table to more than one other table or to JOIN a column in a table to different columns in a second table, choosing the JOIN column based on a condition.

Since T-SQL  has no syntax that would allow for putting a table name into a CASE statement, the first definition of the conditional JOIN really has no means to resolve other than to simply JOIN all the tables (likely as LEFT OUTER JOINs) and use CASE statements to pull the specific data item(s) required from the secondary tables as appropriate.

Recently I was writing a query that could be resolved using the second case.  While I thought about several other ways to do it, from a coding standpoint using a CASE statement in the ON clause of the JOIN resulted in the simplest syntax to make the query work.  However, was syntax that looked simple the most efficient way of doing it?

Let’s face it, sometimes in programming you don’t have the luxury of standing back and redesigning the system to avoid the requirement, but it is always best if we can optimize the performance of the SQL that meets the requirement.

Sample Data

We’ll start with a simple example to illustrate a conditional join.  This code creates two tables, with one million rows of test data in the secondary table that we’ll be doing the conditional JOIN to.

You can see that the four data columns (N1, N2, N3 and N4) contain random numbers in the range 1 to 100.  We’ve selected the TOP 10 rows just to illustrate, and if you’re following along you’ll get different but similar results in the second results set.  The commented-out DROPs are provided to clean up your sandbox later if you want to run these examples on your server.  Note that we’ll be running them using SQL 2012, but SQL 2008 or SQL 2005 can also be used.

Scenario 1: Conditional JOIN Based on Data in the Left Table

Now suppose we have a business requirement that states we want to perform a JOIN from the left table (4 rows) to the secondary table based on the range that the value in the left table falls into.  Such a JOIN may look like this.

And it displays results that look like this (only the first five rows are shown, out of about 40,000 returned).

You can see that because the value of Num of the row in the left table is 53, it is matching on the third N column (N3) in the table JOINed to.

I was interested in TIME and IO STATISTICS so I ran that query with them on and we got these results.

While the query only took about five and a half seconds to run, it looked suspiciously expensive in terms of CPU.  The query plan for it was this.

2270-clip_image002.jpg

First off, we should not be surprised to see a Parallelism operator in the plans, because from our timing results we see that CPU time used is way in excess of the elapsed time for the query.

We also see that both tables are using a Clustered Index Scan operator, but when we look a little deeper into what is going on with the right table (by expanding the details of the Table Spool and Clustered Index Scan operators), we see that they’re trying to operate on four million (actual) rows of data, when there are only one million rows of data in the table!

Our first thought of course is that perhaps we can build an INDEX to speed this bad boy up a bit.  So let’s try that.

When we run that same query again, we get these timing results.

And they are not very impressive, being of approximately the same order of magnitude as the first.  As to the query plan:

2270-clip_image004.jpg

We see that while the INDEX we created is being used, it didn’t seem to help much.  If you repeat this query and hover your cursor over the Index Scan (NonClustered) and Table Spool operators in the graphical query plan, you will find that the same four million actual rows were being used in the query.

It is definitely time to rewrite this query and make it run faster by finding an alternative to the conditional join.

Instead of our conditional JOIN, we can partition the big table into four parts, doing successive JOINs on each of the partitions, and then recombining the parts like this (using UNION ALL):

While that appears to be a much more complex query (it certainly takes a lot longer to write even using copy and paste), we are pretty impressed by the timing results:

While this query is now running in less than half a second (with CPU time down by two orders of magnitude), it does produce a much more (seemingly) complex query plan.

2270-clip_image006.jpg

Even though SQL is doing four Clustered Index Scans of the big table, the results are overwhelmingly in favor of this query over the conditional JOIN.

SQL 2012 grumbled somewhat that an INDEX was missing that could improve the query a bit, so let’s go ahead and create the INDEX that it recommends and try again.

Now we get these timing results:

It looks to me like that recommended INDEX didn’t help one bit!  And I did in fact check the query plan and found that the optimizer used the recommended INDEX.

From my perspective I’d say that the rewrite worked sufficiently that I wouldn’t hesitate to use it in production without the recommended INDEX, thus saving the space that building that INDEX would entail, not to mention the overhead having that INDEX would impose on INSERT, UPDATE and DELETE statements.

I should also point out that in our first (conditional JOIN) attempt, we were working with columns (N1, …, N4) that were all of the same data type.  If in your case, you are not, there are likely additional performance detractors in getting them to the same data type, that you should watch for and consider.

Let’s drop that INDEX before we proceed with our next scenario.

Scenario 2: Conditional JOIN Based on an External Parameter

Sometimes you may have a SQL problem that makes you believe that you need to drive the conditional JOIN key column by a switch of some sort, perhaps one that is passed in as a parameter in a stored procedure.  Let’s look at a simplified example that does not employ a stored procedure:

Here the @Switch local variable controls the column we’ll JOIN on.  Timing results look suspiciously similar to the first case we explored.

Indeed, the query plan, which we won’t show here, looks remarkably similar to the first query plan we showed, right down to the four million actual row counts in the Clustered Index Scan and Table Spool operators.

Using the identical refactoring approach:

We end up with a much more complicated looking query (with a similar, much more complicated query plan), that runs lickety-split!

Now, if you were to be passing the @Switch variable into a stored procedure, it is quite possible that you could run into parameter sniffing issues in either or both of these cases.  But we won’t explore the details here (nor how to resolve them), as they’ve been covered pretty thoroughly by SQL MVP Erland Sommarskog in his excellent blog on this subject “Slow in the Application, Fast in SSMS?

Conclusions

You can never be certain that code that looks elegant on-screen will be executed quickly. The query that uses a conditional JOIN may seem to solve the problem, but it is worth  exploring alternatives to make sure you don’t run into performance issues when your data grows large.

Some other lessons learned from the two scenarios we explored in this article are:

  •  The simplest looking query plan may not always perform the most efficiently.
  •  INDEXes recommended by SQL 2012 in the graphic query plan (and presumably by SQL Tuning Advisor) may not always help your query perform faster, even when the Optimizer chooses to use them.
  •  If performance counts, and it usually does, check your queries for performance issues and consider alternative ways of achieving the same result.

Finally, you may be wondering what happened in the case of my query that I mentioned early on.  It wasn’t quite as simple of course as the two scenarios we explored here.  It was using a big table generated by a schema-bound VIEW and there weren’t only four cases to consider (although there were four potential columns to JOIN upon).  The use of a conditional JOIN got the business solution working quickly, however when I tested the performance I found it wanting.  So I went that extra mile to improve upon it, and ultimately the extra effort yielded satisfactory results.

Thanks for reading,  folks!  I hope that one day you can use the lessons I learned here to improve the speed of at least one of your queries.

Load comments

About the author

Dwain Camps

See Profile

Dwain Camps has been a project manager for many years. Because performance of applications can be a critical success factor for projects, he has been evangelizing on the need to develop highly performing SQL. By mentoring and authoring articles on SQL, he hopes to train a future generation of software engineers on the right and wrong ways to deliver SQL code. He also has a special interest in developing solutions to complex, data intensive problems using high performance SQL because the declarative nature of SQL allows development of algorithmically unique solutions that procedural languages may not be capable of.

Dwain Camps's contributions