Filling In Missing Values Using the T-SQL Window Frame

Since SQL Server delivered the entire range of window functions, there has been far less justification for using the non-standard ex-Sybase 'Quirky Update' tricks to perform the many permutations of running totals in SQL Server. One of these related problems is the 'Data Smear'. Do window functions make this easier, and what is performance like? Dwain Camps investigates

SQL Server gets the full set of Window Aggregate functions.

When Microsoft updated the capabilities of the window aggregate functions in SQL 2012, they added the window frame syntax to the OVER clause.That got everybody talking about the cool new way that running totals could be calculated, for example there’s this blog by Microsoft Certified Master Wayne Sheffield called Running Totals in SQL “Denali” CTP3 (a pre-release version of SQL 2012).There’s also this whole series of articles on the window functions’ capability by noted author and SQL MVP Itzik Ben-Gan:

  1. How to Use Microsoft SQL Server 2012’s Window Functions, Part 1
  2. Microsoft SQL Server 2012: How to Write T-SQL Window Functions, Part 2
  3. SQL Server 2012: How to Write T-SQL Window Functions, Part 3

All of these articles are required reading, if only just to get an idea of the many ways that a window frame can be used.

The Data Smear Problem in SQL

Quirky Update

Unfortunately, the Quirky Update relies on behavior that results from implementation and this behavior is not guaranteed by Microsoft (meaning it could change in a future release of SQL Server). Others simply insist that there are too many rules for using it properly to ensure that you get the results you seek. Itzik Ben-Gan makes a pretty convincing case that this approach is not set-based and is simply a CURSOR in disguise (Ordered UPDATE and Set-Based Solutions to Running Aggregates). I have written about the Quirky Update in earlier articles on T-SQL, without prior reference to an authoritative source on the subject (my apologies to Mr. Ben-Gan for this), mainly because I really couldn’t find the details. The supporters of the Quirky Update mostly dismissed these arguments after a mere mention.

My position on the Quirky Update for the moment is sort of fence-sitting. I’ve always been bothered that it relies on implementation-based behavior, because if that changes when you upgrade your SQL Server, you’re going to have a devil of a time locating all the places where you used it. Nonetheless, it is likely that people who are knowledgeable about it and believe in it, will probably continue using it. So for that reason, I believe it is still fair game to write about it. Certainly, there’s probably little harm in using it for ad-hoc queries that never see the turnover to Production. But an open mind requires consideration of alternatives.

One of my favorite T-SQL challenges is to fill in missing (NULL) values.  This is a technique for cleaning data sets where a blank entry meant ‘continue with the value for this column that was in the previous non-blank row’: blanks being represented by NULLs.

Row Thickness
(mm)
width (metres) height (metres)
1 3 3 3
2 4 3
3 5 6
4 5 2 3
5 3 4

This was once engineering standard practice in printed lists and reports. Here, the first three rows refer to sheets that  have a thickness of three millimeters, and the next two rows refer to sheets of five millimeters. When transferred to the database, those blank entries appear as NULLS. How then do we clean up this data?

Prior to SQL 2012, the easiest and most performance-efficient method of doing this “data smear” was using a Quirky Update (QU), though much of this was done with cursors.  The Quirky Update method of updating column values using information from previous rows held in variables goes back to the earliest days of SQL Server, and was inherited from Sybase.

While the Quirky Update will still work in SQL 2012, the window frame option of the window functions now offers us a much safer, set-based and  standards-compliant  way of doing it. There is an opportunity for potentially simpler means of producing the data smear type of operation. Let’s first figure out what I’m talking about using a simple example.

Some Sample Data and the Quirky Update Solution

We’ll start by creating a temporary table and populating it with some data.

The results we get from the final SELECT look like this.

An important point to note about this sample data is that the number of NULL (unknown) values between known values is not consistently the same.

The idea is that we want to “smear” the value of v from id=”through” IDs=2 and 3.Then the value from id=”must” be smeared through IDs=5, 6 and 7, etc. so that in the end we are left with a data set with no NULL values for v.

This is quite simple to do with a Quirky Update, thusly:

The final SELECT gives us these results:

Note that we had to use two hints on that query: TABLOCKX and MAXDOP.That’s part of what I’m talking about when I say there’s lots of finicky rules to making the Quirky Update work properly (the other being the PRIMARY KEY/CLUSTERED INDEX on ID).For all of the rules, you can take a look at this article by SQL MVP Jeff Moden: “Solving the Running Total and Ordinal Rank Problems.” Incidentally, Jeff’s the same guy who introduced me to the terminology of the “data smear.”From this example, you can see just how apt that terminology is.Do take care though if you have other INDEXes on your table, because they can adversely affect this implementation-based behavior.

The Data Smear Using Microsoft SQL Server 2012

Before you can run this next example, you’ll want to repopulate your table with the original rows that contain the NULL values.

Let’s take a brief look at some code magic to help us understand how we can use a window frame to accomplish our data smear.

This produces the following results:

Prior to SQL 2012, the COUNT window aggregate function required a PARTITION within the OVER clause.Using ORDER BY instead, we are now forcing the COUNT to operate over a window frame.The result (c2) represents the full syntax for specifying the window frame.The result (c2) also explicitly introduces the window frame but the upper boundary of the set is omitted, defaulting to CURRENT ROW.The final result (c3) simply shows that the default window frame is the one we explicitly used in c1 and c2 (and is the most compact, equivalent syntax).

If we look at the results in any of the c columns, we immediately notice that within each value produced by the COUNT(v), we have only a single known value of v.We can use this calculated result to PARTITION our set.

In order to perform our data smear, now we just need to use a more traditional window aggregate (without the frame), and use one of our c columns as the PARTITION.We’ll use the MAX window aggregate, because we can always be sure (given default SQL Server settings) that the known value will always be greater than all the other unknown (NULL) values.

So from this we get the results we’d like to see in our smear.

In a previous article (The Performance of the T-SQL Window Functions), I found that sometimes performance using a window aggregate (like MAX/OVER/PARTITION) can be improved by pre-aggregation.Assuming the SQL Optimizer is smart enough to create a good query plan that doesn’t require multiple passes to perform the window frame (COUNT/OVER/ORDER BY), the following query which produces the same results might prove to be a little swifter.

Another approach would be to employ a ROW_NUMBER() OVER the c PARTITION to establish a variable jump-back, row increment to use with LAG.That probably requires an example to make it clear, so first we’ll create the variable jump-back.

Which produces this result:

For id=”we” aren’t really concerned with the value of rn because we already have a known value for v.For id=”rn=1″ and that gives us the jump-back value to use in LAG to pick up 121 out of the v column.Likewise for IDs 3, 4 and 5.We see rn drop back to 0 at id=”and” then begin incrementing again across each of the NULL following rows.

Now we can put this together with LAG and end up with something that is a bit more elaborate than either of the two prior techniques, but does produce the same results.

We’ll get around to showing the performance of these four alternatives in just a few minutes.

Another Simple Data Smear

Suppose we want something slightly different filling in the NULL values, for example a row number like 1, 2, 3, …The same PARTITION we constructed can be used in this case, and so can the ROW_NUMBER in the preceding example.

Which produces these results:

This could also be done with a Quirky Update, like this:

Suppose we wanted our smear to decrement through the smeared rows.That is a quite simple modification to the window function version (simply change to ORDER BY ID DESC in the ROW_NUMBER() OVER clause), but would be quite difficult using a Quirky Update.You’d need to reverse the order of the PRIMARY KEY/CLUSTERED INDEX.

A Different Data Smear we can do in SQL 2012

Let’s consider a slightly different case; one that cannot easily be accomplished with a Quirky Update.There are ways to accomplish it without using a window frame, and we’ll show one of those a little later.

Suppose we want to fill in our NULL values with the average of the previous and following values.So the value we’d like to see in rows 2 and 3 is the average of 121 and 312 (=216.5).

The SQL 2012 analytical function LEAD comes to mind.Just like LAG, LEAD requires that we know precisely how many rows we want to jump ahead.Unfortunately, the window frame concept doesn’t really help us here, because we can only jump ahead a fixed number of (or all) rows.

So let’s think some more about LEAD.We can construct a “forward jump value” that varies depending on which row we’re on, just like we created a backwards jump value in the previous example using LAG.We’ll use ROW_NUMBER() once again to accomplish this.

These results are:

In this result set, we can see that where the value of v is NULL, our value for n is precisely the number of rows we need to jump ahead (LEAD) to get the second of the two numbers we need (s is the other) to calculate our average.

Let’s do that.

This produces a result that smears our average through the NULL values.

Note that had our last row been a NULL value for v, the smear would have produced a NULL result for that last row also.But if you simply wanted to smear the last value through those trailing NULL rows, you could have specified s as the third argument to the LEAD function:

Smearing a Smoothed Result

Suppose we want a bit smoother transition from our trailing value to our following value.This could be particularly appealing if we have many NULL rows between sparse data points.Let’s illustrate this by looking at the first four rows of the above results set.

First let’s look at some intermediate results using ROW_NUMBER() and the COUNT window aggregate.

Our intermediate results look pretty useful:

You can see that all three of our calculated columns are what we need to produce the smoothing calculation shown prior.The first, s, is of course our original smear.The columns m and x, when m is divided by x, produce the multiplier we need.We have omitted the LEAD result, but in the end we’ll use that also (so n is likewise still required).

Putting that all together, we get this.

And this produces our nicely smoothed data smear:

We have just done a simple linear interpolation in T-SQL!

Performance Comparison of the Basic Data Smear

When I write an article, I always like to include a bit of a performance comparison.Good grief!Was that an expectation I just set? (Ed: No, we had the expectation already.)

Since we have four approaches to our basic data smear, that’s a good candidate.First we’ll construct a one million row test harness and NULL out the v column for approximately 525,000 rows.

The rest of the test harness, queries labelled 1-3 above plus the QU as query #4 can be found in the attached resources file, so you can run it yourself to see the benchmark on your system.Our results looked like this:

The QU was run last because once it updates the data smear into the existing column, you can’t go back.Obviously the QU is much faster for this case, which is not a surprise considering it has long held the reputation of being the fastest approach for the running totals problem.What was a bit of a surprise was that Query 2 was quite a bit slower than Query 1, with the LAG approach (Query 3) coming in around the middle of the three SQL 2012 solutions.

Let’s look at our smearing of the average value across the unknown rows.Here’s a traditional method that accomplishes the same thing.

Using the same basic test harness (a second resource file has been provided containing it), we get the followingresults.In this case, we’ve added some INDEXing to see what effect it would have, whereas we didn’t dare do so for the data smear above using the Quirky Update.

This isn’t looking very good at all for our window frame techniques! At least not for solving this particular problem in the way that I did. The traditional method for this more elaborate smear even outperforms the basic smear using the window function.

I think the most surprising part of these results was that the traditional approach bested the window frame approach by an order of magnitude.

Conclusions

While the window frame syntax of the OVER clause does offer new approaches to tackling traditional query problems, we have found (somewhat to our surprise) that they don’t always yield better performing results.

Even though the Quirky Update has its detractors, it is still the reigning performance champion on our simple data smears.Just please take care if you are one of its supporters, and take steps to ensure that you re-test it if you ever upgrade your SQL Server.

For the case of smearing our average, it was quite easy to come up with a solution that runs in SQL 2005 that outperformed the window frame alternative.

Finally, we’ll leave it as a challenge to our interested readers to come up with a good performing alternative to the linear interpolation data smear.Given the results we got, we’re thinking that might not be too hard.

It is always exhilarating to explore new techniques to traditional query problems.The learning experience of exploring these window frame variants, and reporting on the facts that we can conclude from them, is justification in itself for writing an article of this sort.

Just please remember, when you come up with new alternatives, you should always test, test and then test again to make sure your result meets the performance expectations you should always set for yourself.