Row-wise system function calls, always the same?

Assumptions are generally terrible things for programmers (and likely my next “What Counts” blog goes there), because what looks to be true when trying things out on your machine, may not be so at a larger scale. There are lots of examples out there, sorting when there’s a clustered index, not checking errors because you there aren’t going to be any to catch, not thinking that users will put in values larger than the column takes, etc, etc.

In this blog, I will outline a place where an assumption that a programmer made that was wrong was painfully wrong (and showed up less than .1% of the time, which in turn made the problem very hard to find!) I would love to blame this person who wrote the original code, but I believed the same thing, and would never have learned what I am going to show you without a tricky bug not showing itself in a weird way.

There was a query in our Data Warehouse code, used to set a start and end time for a Type 2 dimension, that essentially did the following:

It was attached to rows from a query, but suffice it to say that the goal output was something such as:

2 datetime values for the rows, with 1 millisecond difference. This code has executed many thousands of times since it was coded about 10 years ago. I have executed this in a single statement like this many, many times when testing the issue, and preparing for this blog, and have not witnessed any difference in a single execution. Occasionally, we saw cases where both values that originated in the rowset in SSIS were exactly the same. There is additional code that is called in the stack after this, so I never really thought it could be code such as this causing the issue.

I knew, of course that some functions worked by executing multiple times. For example, executing:

Will almost certainly return two different values from two different executions. Of course, there is no guarantee that the values will be different, only the extreme likelihood of their difference (I didn’t see any duplication in millions of executions, but random is random, and someone generally wins the lottery too). Yet for some reason, I think the fact that some functions are guaranteed the same result, such as how the SEQUENCE object fetch value function NEXT VALUE FOR, will only fetch one per row, made me think that something like SYSDATETIME() would naturally  return the same for multiple calls. The two executions of RAND() will return the same values over rows, but not in multiple columns.

Recently our team started a practice where every two weeks we stop and work on a small story that can be completed in a day, but never would make it to the top of the list. This bug never happened in our production environment, so it never bubbled up to critical. But man, a bug that happens randomly, only on one server… this was killing us. I did not believe that the aforementioned code could be the issue, but I couldn’t find any guarantees that the output would be the same, so I tried to see if I could find the issue that day. I was very confused by the result. 

To try it out, created a temp table:

Then in a loop, I ran progressively greater sets of inserts, in this case 100000:

After which, I compared the data. On my Surface Pro 4 i7, this query:

On my last run for this blog, it returned 122 rows in my last execution, here are a couple of examples:

The first was less than 1 millisecond, but still a difference. The second row was only .0000001 different. However, in the other two columns:

The data is exactly as expected, even though the other two calls would have returned .902 and .903 if simply rounded off. On the other hand, looking for differences between the time1_3 and time2_3 columns:

Returns 133 rows. With the sysdatetime values being exactly the same:

But the other columns, are incorrect for our needs, as the values are the same:

This was the bug we experienced! Instead of being 1 millisecond different, the values were equal. And what looked like a random issue, was not actually random, but a matter of performance and luck. This test returns smaller number of rows on my development server than the Surface Pro 4, and fewer still in production. Since this happens less than .1% of the time, even on my Surface, it is amazing that it occurred at all, only happening just enough to make a person nuts.

Note: I tested this with GETDATE() as well, seeing the same issues. I did not try other functions, but if I ever need the column-wise values to the be same, I will not use this method. 

So what is the correct way? Use a value stored in a variable instead of multiple function calls if you need to ensure that values are the same across columns, as well as rows:

Data matches 100% of the time. 

The moral of the story? Never assume something you can’t find documented as such. The biggest issue with this is that my test may have not returned any rows that didn’t match as I was testing it. Had it not, I may have never solved this issue.

Special thanks to Randolph West, Aaron Bertrand, Erland Sommarskog, and Paul White who helped me to realize that this was not something that was guaranteed, and it wasn’t the CAST operation that was causing it.