Query Optimizer and Cartesian Products

If you use the Query Optimizer (QO) enough, you will eventually run into this interesting observation: There are some situations where the QO just decides to remove a join instruction from a query, and that behavior, in turn, creates a Cartesian product (in other words, the same thing as a join but without the structured relationship – essentially a Cross Join). For instance, take the following example:

USE tempdb

DECLARE @tab1 TABLE(a INT)

INSERT INTO @tab1

SELECT TOP 1000 1

FROM sysobjects b, sysobjects a

SET STATISTICS io ON

SET STATISTICS time ON

SELECT *

FROM @tab1 a

INNER JOIN @tab1 b

ON a.a = b.a

SELECT *

FROM @tab1 a

INNER JOIN @tab1 b

ON a.a = b.a

WHERE a.a = 1

SET STATISTICS io OFF

SET STATISTICS time OFF

Both of the queries above return the same rows from the @tab1 table, but there is obviously a difference between the first instruction and the second. As we can easily see, the difference is the WHERE condition present in the second query (“a.a” must be equal to 1), and that is the key. The QO knows that there is a redundancy of filters – if “a.a” is equal to 1 and “a.a” is also equal to “b.a” (the JOIN condition), then obviously “b.a” will also be equal to 1. In that case, the QO removes the join condition and applies just the one “a = 1” filter to both “a” and “b” tables.

This behavior can be clearly seen when we look at the execution plan:

StmtText	Argument
select * from @tab1 a inner join @tab1 b on a.a = b.a where a.a = 1	NULL
\|–Nested Loops(Inner Join)	NULL
\|–Table Scan(OBJECT:(@tab1 AS [b]), WHERE:(@tab1.[a] as [b].[a]=(1)))	OBJECT:(@tab1 AS [b]), WHERE:(@tab1.[a] as [b].[a]=(1))
\|–Table Scan(OBJECT:(@tab1 AS [a]), WHERE:(@tab1.[a] as [a].[a]=(1)))	OBJECT:(@tab1 AS [a]), WHERE:(@tab1.[a] as [a].[a]=(1))

Take notice of the argument column; the QO has generated just one argument to read both tables, and there is no join condition in the Nested Loops. That characterizes a Cartesian product, in that for each row of the table ‘a’, the SQL will link with all rows of the table ‘b’.

Ok, but is that a bad thing? In this case, yes. To have that particular query use a hash algorithm to make the join will certainly be better than a Loop Join. If you’re not sure about that, just look at the first query, remembering that they are both semantically equivalent. If you then look at the plan for that first query (below), you can see the QO has generated a Hash Join to link the table:

StmtText	Argument
select * from @tab1 a inner join @tab1 b on a.a = b.a	NULL
\|–Hash Match(Inner Join, HASH:([b].[a])=([a].[a]), RESIDUAL:(@tab1.[a] as [b].[a]=@tab1.[a] as [a].[a]))	HASH:([b].[a])=([a].[a]), RESIDUAL:(@tab1.[a] as [b].[a]=@tab1.[a] as [a].[a])
\|–Table Scan(OBJECT:(@tab1 AS [b]))	OBJECT:(@tab1 AS [b])
\|–Table Scan(OBJECT:(@tab1 AS [a]))	OBJECT:(@tab1 AS [a])

If we then analyze the results of the IO and time statistics, we will see that the first query used less time and read fewer pages to return the data:

First query (Hash Join):
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table '#2E75B1C0'. Scan count 2, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times:    CPU time = 328 ms, elapsed time = 19976 ms.

Second query (Loop Join):
Table '#2E75B1C0'. Scan count 2, logical reads 2002, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times:    CPU time = 407 ms, elapsed time = 20377 ms.

Now, if you’re clever, you’re probably thinking, “Hmmm, in that case we could just put a Hint in to force a Hash Join, yes?” Wrong (at least at this stage…). If you try this, you will receive the follow error message:

SELECT *

FROM @tab1 a

INNER JOIN @tab1 b

ON a.a = b.a

WHERE a.a = 1

OPTION(hash JOIN)

Msg 8622, Level 16, State 1, Line 20

Query processor could not produce a query plan because of the hints defined in this query. Resubmit the query without specifying any hints and without using SET FORCEPLAN.

When SQL removes the join condition, it needs to make a Cartesian product, and the QO can’t do this with a Hash or Merge Join, only with a Loop join. So in this case, the logic of the QO clashes with the hint.

What you can do to fix that? Wait.

Yes, that’s all you can do. Wait for the next version or point release of SQL Server, and trust the development team to imitate Oracle. To ‘fix’ this problem, Oracle has a handy parameter, which you can look here to learn more about. There is a Connect item talking about the problem, so we should keep watching this space.

That’s all folks.

Register for Simple Talk

Query Optimizer and Cartesian Products

Article tags

About the author

Fabiano Amorim

Fabiano Amorim's contributions

Articles

Books

Top topics

Fabiano Amorim's latest contributions:

SQL Compilations/sec is not what you think it is

Exploring errors to reveal unauthorized information

Tempdb – Here’s a Problem You Didn’t Know You Had

Article tags

Recommended

About the author

Fabiano Amorim

Fabiano Amorim's contributions

Articles

Books

Top topics

Fabiano Amorim's latest contributions:

SQL Compilations/sec is not what you think it is

Exploring errors to reveal unauthorized information

Tempdb – Here’s a Problem You Didn’t Know You Had