Assumptions are generally terrible things for programmers (and likely my next “What Counts” blog goes there), because what looks to be true when trying things out on your machine, may not be so at a larger scale. There are lots of examples out there, sorting when there’s a clustered index, not checking errors because you there aren’t going to be any to catch, not thinking that users will put in values larger than the column takes, etc, etc.
In this blog, I will outline a place where an assumption that a programmer made that was wrong was painfully wrong (and showed up less than .1% of the time, which in turn made the problem very hard to find!) I would love to blame this person who wrote the original code, but I believed the same thing, and would never have learned what I am going to show you without a tricky bug not showing itself in a weird way.
There was a query in our Data Warehouse code, used to set a start and end time for a Type 2 dimension, that essentially did the following:
1 2 |
SELECT CAST(SYSDATETIME() AS datetime2(3)) AS startTime, CAST(DATEADD(MILLISECOND,-1,SYSDATETIME()) AS datetime2(3)) AS previousEndTime |
It was attached to rows from a query, but suffice it to say that the goal output was something such as:
1 2 3 |
startTime previousEndTime --------------------------- --------------------------- 2018-06-18 20:43:17.002 2018-06-18 20:43:17.001 |
2 datetime values for the rows, with 1 millisecond difference. This code has executed many thousands of times since it was coded about 10 years ago. I have executed this in a single statement like this many, many times when testing the issue, and preparing for this blog, and have not witnessed any difference in a single execution. Occasionally, we saw cases where both values that originated in the rowset in SSIS were exactly the same. There is additional code that is called in the stack after this, so I never really thought it could be code such as this causing the issue.
I knew, of course that some functions worked by executing multiple times. For example, executing:
1 |
SELECT RAND(), RAND() |
Will almost certainly return two different values from two different executions. Of course, there is no guarantee that the values will be different, only the extreme likelihood of their difference (I didn’t see any duplication in millions of executions, but random is random, and someone generally wins the lottery too). Yet for some reason, I think the fact that some functions are guaranteed the same result, such as how the SEQUENCE object fetch value function NEXT VALUE FOR, will only fetch one per row, made me think that something like SYSDATETIME() would naturally return the same for multiple calls. The two executions of RAND() will return the same values over rows, but not in multiple columns.
Recently our team started a practice where every two weeks we stop and work on a small story that can be completed in a day, but never would make it to the top of the list. This bug never happened in our production environment, so it never bubbled up to critical. But man, a bug that happens randomly, only on one server… this was killing us. I did not believe that the aforementioned code could be the issue, but I couldn’t find any guarantees that the output would be the same, so I tried to see if I could find the issue that day. I was very confused by the result.
To try it out, created a temp table:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
DROP TABLE IF EXISTS #hold GO SET NOCOUNT ON; CREATE TABLE #hold ( --the entire sysdatetime value sysdatetime1 datetime2(7), sysdatetime2 datetime2(7), --cast to datetime2(3), 1 ms apart time1_3 datetime2(3), time2_3 datetime2(3) ); |
Then in a loop, I ran progressively greater sets of inserts, in this case 100000:
1 2 3 4 5 6 7 8 9 10 |
DECLARE @count int = 0; WHILE @count < 100000 BEGIN INSERT INTO #hold (sysdatetime1,sysdatetime2,time1_3,time2_3) VALUES (SYSDATETIME(),SYSDATETIME(),DATEADD(MILLISECOND,1,SYSDATETIME()), SYSDATETIME()); SET @count = @count + 1; END; |
After which, I compared the data. On my Surface Pro 4 i7, this query:
1 2 3 |
SELECT * FROM #hold WHERE sysdatetime1 <> sysdatetime2; |
On my last run for this blog, it returned 122 rows in my last execution, here are a couple of examples:
1 2 3 4 |
sysdatetime1 sysdatetime2 --------------------------- --------------------------- 2018-06-18 20:53:20.9017639 2018-06-18 20:53:20.9027165 2018-06-18 20:53:20.9296443 2018-06-18 20:53:20.9306442 |
The first was less than 1 millisecond, but still a difference. The second row was only .0000001 different. However, in the other two columns:
1 2 3 4 |
time1_3 time2_3 --------------------------- --------------------------- 2018-06-18 20:53:20.904 2018-06-18 20:53:20.903 2018-06-18 20:53:20.932 2018-06-18 20:53:20.931 |
The data is exactly as expected, even though the other two calls would have returned .902 and .903 if simply rounded off. On the other hand, looking for differences between the time1_3 and time2_3 columns:
1 2 3 |
SELECT * FROM #hold WHERE DATEDIFF(MILLISECOND,time1_3,time2_3) <> -1; |
Returns 133 rows. With the sysdatetime values being exactly the same:
1 2 3 4 |
sysdatetime1 sysdatetime2 --------------------------- --------------------------- 2018-06-18 20:53:21.0443399 2018-06-18 20:53:21.0443399 2018-06-18 20:53:21.0493653 2018-06-18 20:53:21.0493653 |
But the other columns, are incorrect for our needs, as the values are the same:
1 2 3 4 |
time1_3 time2_3 --------------------------- --------------------------- 2018-06-18 20:53:21.045 2018-06-18 20:53:21.045 2018-06-18 20:53:21.050 2018-06-18 20:53:21.050 |
This was the bug we experienced! Instead of being 1 millisecond different, the values were equal. And what looked like a random issue, was not actually random, but a matter of performance and luck. This test returns smaller number of rows on my development server than the Surface Pro 4, and fewer still in production. Since this happens less than .1% of the time, even on my Surface, it is amazing that it occurred at all, only happening just enough to make a person nuts.
Note: I tested this with GETDATE() as well, seeing the same issues. I did not try other functions, but if I ever need the column-wise values to the be same, I will not use this method.
So what is the correct way? Use a value stored in a variable instead of multiple function calls if you need to ensure that values are the same across columns, as well as rows:
1 2 3 4 5 6 7 8 9 10 |
DECLARE @count int = 0; WHILE @count < 100000 BEGIN DECLARE @sysdatetime datetime2 = SYSDATETIME(); INSERT INTO #hold (sysdatetime1,sysdatetime2,time1_3,time2_3) VALUES (@sysdatetime,@sysdatetime,DATEADD(MILLISECOND,1,@sysdatetime), @sysdatetime); SET @count = @count + 1; END; |
Data matches 100% of the time.
The moral of the story? Never assume something you can’t find documented as such. The biggest issue with this is that my test may have not returned any rows that didn’t match as I was testing it. Had it not, I may have never solved this issue.
Special thanks to Randolph West, Aaron Bertrand, Erland Sommarskog, and Paul White who helped me to realize that this was not something that was guaranteed, and it wasn’t the CAST operation that was causing it.
Load comments