One of the things that separates a good programmer from a great one is a firm understanding about what is going on inside the computer. For some programming languages, it is very obvious what is going on inside the computer because you are working at a very low level. For example, if you are a C/C++ programmer writing an OS, you will know a lot about the hardware as you will interact with it directly. As a .NET programmer you are more encapsulated from the hardware experience, making use of the .NET framework.
None of the aforementioned programming languages comes anywhere close to the level of encapsulation that we SQL programmers work with. When you execute a statement like:
SELECT *
FROM Tablename
A firestorm of code is executed to optimize your query, find the data on disk, fetch that data, format it for presentation, and then send it to the client. And this is the super dumbed down version. SQL is a declarative language, where basically we format a question or task for the system to execute without telling it how. It is my favorite type of language because all of the pushing bits around get tedious. However, what is important for the professional SQL programmer is to have some understanding of what is going under the covers, understanding query plans, disk IO, CPU, etc. Not necessarily to the depth that Glenn Alan Berry (http://sqlserverperformance.wordpress.com/) does, but certainly a working knowledge.
Performance is the obvious reason, since it is clearly valuable to be able to optimize a query, but sometimes it can come in handy to debug an issue you are having with a query. Today, I ran across an optimizer condition that, while perfectly understandable in functional terms, would have driven me closer to nuts if I hadn’t been able to read a query plan. The problem came in based on the number of rows returned, either it worked perfectly or it failed with an overflow condition. Each query seemingly touches the exact same rows in the table where the overflow data exists…or did it.
The setup. The real query that the problem was discovered in was our data warehouse, and was a star schema configuration with 20+ joins. In the reproduction, I will use a simple table of numbers to serve as the primary table of the query.
create table ExampleTable — It really doesn’t matter what this table has. The datevalue column will be used to
— join to the date table, that I will load from the
( — from the values I put in this table to make sure all data does exist
i int constraint PKExampleTable primary key,
dateValue date
)
;with digits (i) as( –(The code for this comes from my snippet page: http://www.drsql.org/Pages/Snippets.aspx).
select 1 as i union all select 2 as i union all select 3 union all
select 4 union all select 5 union all select 6 union all select 7 union all
select 8 union all select 9 union all select 0)
,sequence (i) as (
select D1.i + (10*D2.i) + (100*D3.i) + (1000*D4.i) + (10000*D5.i) + (100000*D6.i)
from digits as D1, digits AS D2, digits AS D3 ,digits AS D4, digits as D5, digits As D6
)
insert into ExampleTable(i, dateValue)
select i, dateadd(day, i % 10,getdate()) — Puts in 10 different date values
from sequence
where i > 0 and i < 1000
order by i
Next I will load the date table with all of the distinct dateValue values that we loaded into the ExampleTable, plus one, which is the max date value for the datatype. In the “real” world case, this is one of our surrogate null values we use to indicate that it is the end date. (Yes, we are ignoring the Y10K problem.)
create table date
(
datevalue date constraint PKDate primary key
)
insert into date
select distinct dateValue
from ExampleTable
union all
select ‘99991231’
go
In the typical usage, the number of rows is quite small. In our queries, we are adding 1 to the dateValue to establish a range of a day (in the real query it was actually a month). Executing the following query that returns 99 rows is successful:
select *, dateadd(day,1,date.dateValue)
from ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
where i < 100
However, remove the where clause (causing the query to return 999 rows):
select *, dateadd(day,1,date.dateValue)
from ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
And you will see that this results in an overflow condition…
Msg 517, Level 16, State 3, Line 2
Adding a value to a ‘date’ column caused an overflow.
Hmmm, this could be one of those days where I don’t get a lot of sleep :). Next up, I check the max date value that can be returned.
–show that the top value that could be returned is < maxdate
select max(date.dateValue)
from ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
At this point, I start feeling like I am going nuts. The value returned is 2013-01-30. So no data is actually returned that should be too large for our date column… So then I think, well, let’s add one to that value and take the max:
select max(date.dateValue), max(dateadd(day,1,date.dateValue))
from ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
This returns, mockingly:
Msg 517, Level 16, State 3, Line 2
Adding a value to a ‘date’ column caused an overflow.
So, since it worked with fewer rows earlier. I decide to try lowering the number of rows again, this time using a derived table, and it DOESN’T error out, even though it is obvious (because I stacked the deck…data) that the same data is just repeated for the dateValue, particularly since we get the same max dateValue as we did earlier.
select max(date.dateValue), max(dateadd(day,1,date.dateValue))
from (select top 100 * from ExampleTable order by i) as ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
Well, you are possibly thinking, this just doesn’t make sense. It is how I felt too after trying to do the logic in my head. I will admit that I didn’t know about query plans I would have been completely lost. But alas, the answer was fairly easily located in the plan. Taking a look at the plan for the query version that returns 99 rows:
select *, dateadd(day,1,date.dateValue)
from ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
where i <= 100
We get the following estimated plan:
In this plan, it uses a nested loops operator, which basically will do 100 seeks from the top input (the ExampleTable), for each row fetching the date value, and then calculating the scalar value (dateadd(day,1,date.dateValue) ) on the values that match in the plan. Since the 9999-12-31 date is never used, there is no overflow.
However, when the number of rows in the when the size of the output reaches a certain tolerance (in this case 999 instead of 99) from the following query:
select *, dateadd(day,1,date.dateValue)
from ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
We get a different plan, one that is causing us issues:
Instead of nested loops, it uses a Hash Match Join, which takes the entirety of the smaller table and builds an internal hash index (basically setting up buckets that can be scanned much faster than an entire table…in our case, probably just a single bucket), and then scan the other set checking to see if the row exists in the hash index.
It is in the process of building the hash index that our query runs into trouble. Since the date table is so much smaller, it plans to build the hash index on that table, and pre-creates the scalar values as it is doing the scan, since there are 11 rows in the date table, rather than having to calculate the value 999 times if it did it after the join. When it adds a day to the 9999-12-31 date, it fails.
I know, the question of how practical is this scenario is bound to arise. I won’t lie to you and suggest that it is likely to happen to you as it it did to me. However, the point of this blog isn’t that this one scenario is bound to happen to you, but rather that understanding how SQL Server executes queries will help to give you insight to fix problems with your system, mostly performance, but sometimes every esoteric issues that won’t just leap out as being based on the query plan that was chosen. (For more reading on query plans, check out Grant Fritchey’s Simple-Talk book on query plans: http://www.amazon.com/Server-Execution-Plans-Grant-Fritchey/dp/1906434026).
In the end, the fix to my problem was simple. Make sure that the value that has meaning in the table, but not in the query, was filtered out:
select *, dateadd(day,1,date.dateValue)
from ExampleTable
join date
on date.dateValue = ExampleTable.dateValue
and date.dateValue < ‘9999-12-31’
Note: a commenter noted that in some cases, excluding the offensive data using the ON criteria/WHERE clause may not solve the issue. This is very true, and really will be made evident in the plan. I would expect it to be more likely to be definitely excluded in the JOIN clause, but you really can’t guarantee anything that the optimizer might do without changing the source data (or representing the source data using a derived table as):
select *, dateadd(day,1,date.dateValue)
from ExampleTable
join (select * from date where date.dateValue < ‘9999-12-31’) as date
on date.dateValue = ExampleTable.dateValue
Looking at the different variances to the plan you should be able to diagnose a “hidden” problem such as I have described by finding the flow of data and making sure that the filtering operation happens before the calculating of the scalar that causes the overflow error. This may harm performance in my query for even the more “ideal” case where it could have used indexes, so you may yet have more work to do…But this is what makes data programming fun, now isn’t it?
Load comments