SQL Server gets the full set of Window Aggregate functions.

When Microsoft updated the capabilities of the window aggregate functions in SQL 2012, they added the window frame syntax to the OVER clause.That got everybody talking about the cool new way that running totals could be calculated, for example there’s this blog by Microsoft Certified Master Wayne Sheffield called Running Totals in SQL “Denali” CTP3 (a pre-release version of SQL 2012).There’s also this whole series of articles on the window functions’ capability by noted author and SQL MVP Itzik Ben-Gan:

All of these articles are required reading, if only just to get an idea of the many ways that a window frame can be used.

The Data Smear Problem in SQL

Quirky Update

Unfortunately, the Quirky Update relies on behavior that results from implementation and this behavior is not guaranteed by Microsoft (meaning it could change in a future release of SQL Server). Others simply insist that there are too many rules for using it properly to ensure that you get the results you seek. Itzik Ben-Gan makes a pretty convincing case that this approach is not set-based and is simply a CURSOR in disguise (Ordered UPDATE and Set-Based Solutions to Running Aggregates). I have written about the Quirky Update in earlier articles on T-SQL, without prior reference to an authoritative source on the subject (my apologies to Mr. Ben-Gan for this), mainly because I really couldn’t find the details. The supporters of the Quirky Update mostly dismissed these arguments after a mere mention.

My position on the Quirky Update for the moment is sort of fence-sitting. I’ve always been bothered that it relies on implementation-based behavior, because if that changes when you upgrade your SQL Server, you’re going to have a devil of a time locating all the places where you used it. Nonetheless, it is likely that people who are knowledgeable about it and believe in it, will probably continue using it. So for that reason, I believe it is still fair game to write about it. Certainly, there’s probably little harm in using it for ad-hoc queries that never see the turnover to Production. But an open mind requires consideration of alternatives.

One of my favorite T-SQL challenges is to fill in missing (NULL) values. This is a technique for cleaning data sets where a blank entry meant ‘continue with the value for this column that was in the previous non-blank row’: blanks being represented by NULLs.

Row	Thickness (mm)	width (metres)	height (metres)
1	3	3	3
2		4	3
3		5	6
4	5	2	3
5		3	4

This was once engineering standard practice in printed lists and reports. Here, the first three rows refer to sheets that have a thickness of three millimeters, and the next two rows refer to sheets of five millimeters. When transferred to the database, those blank entries appear as NULLS. How then do we clean up this data?

Prior to SQL 2012, the easiest and most performance-efficient method of doing this “data smear” was using a Quirky Update (QU), though much of this was done with cursors. The Quirky Update method of updating column values using information from previous rows held in variables goes back to the earliest days of SQL Server, and was inherited from Sybase.

While the Quirky Update will still work in SQL 2012, the window frame option of the window functions now offers us a much safer, set-based and standards-compliant way of doing it. There is an opportunity for potentially simpler means of producing the data smear type of operation. Let’s first figure out what I’m talking about using a simple example.

Some Sample Data and the Quirky Update Solution

We’ll start by creating a temporary table and populating it with some data.

CREATE TABLE #X

(

IDINT IDENTITY PRIMARY KEY

,vINT

);

INSERT INTO #X

SELECT 121 UNION ALL SELECT NULL UNION ALL SELECT NULL

UNION ALL SELECT 312 UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL

UNION ALL SELECT 123 UNION ALL SELECT NULL UNION ALL SELECT NULL

UNION ALL SELECT 415 UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL UNION ALL SELECT NULL

UNION ALL SELECT 200;

SELECT * FROM #X;

The results we get from the final SELECT look like this.

ID v

1 121

2 NULL

3 NULL

4 312

5 NULL

6 NULL

7 NULL

8 123

9 NULL

10 NULL

11 415

12 NULL

13 NULL

14 NULL

15 NULL

16 200

An important point to note about this sample data is that the number of NULL (unknown) values between known values is not consistently the same.

The idea is that we want to “smear” the value of v from id=”through” IDs=2 and 3.Then the value from id=”must” be smeared through IDs=5, 6 and 7, etc. so that in the end we are left with a data set with no NULL values for v.

This is quite simple to do with a Quirky Update, thusly:

DECLARE @v INT;

UPDATE #x WITH(TABLOCKX)

SET @v = v = CASE WHEN v IS NULL THEN @v ELSE v END

OPTION(MAXDOP 1);

SELECT * FROM #x;

The final SELECT gives us these results:

ID v

1 121

2 121

3 121

4 312

5 312

6 312

7 312

8 123

9 123

10 123

11 415

12 415

13 415

14 415

15 415

16 200

Note that we had to use two hints on that query: TABLOCKX and MAXDOP.That’s part of what I’m talking about when I say there’s lots of finicky rules to making the Quirky Update work properly (the other being the PRIMARY KEY/CLUSTERED INDEX on ID).For all of the rules, you can take a look at this article by SQL MVP Jeff Moden: “Solving the Running Total and Ordinal Rank Problems.” Incidentally, Jeff’s the same guy who introduced me to the terminology of the “data smear.”From this example, you can see just how apt that terminology is.Do take care though if you have other INDEXes on your table, because they can adversely affect this implementation-based behavior.

The Data Smear Using Microsoft SQL Server 2012

Before you can run this next example, you’ll want to repopulate your table with the original rows that contain the NULL values.

Let’s take a brief look at some code magic to help us understand how we can use a window frame to accomplish our data smear.

SELECT ID, v

,c1=COUNT(v) OVER (ORDER BY ID ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)

,c2=COUNT(v) OVER (ORDER BY ID ROWS UNBOUNDED PRECEDING)

,c3=COUNT(v) OVER (ORDER BY ID)

FROM #X

ORDER BY ID;

This produces the following results:

ID v c1 c2 c3

1 121 1 1 1

2 NULL 1 1 1

3 NULL 1 1 1

4 312 2 2 2

5 NULL 2 2 2

6 NULL 2 2 2

7 NULL 2 2 2

8 123 3 3 3

9 NULL 3 3 3

10 NULL 3 3 3

11 415 4 4 4

12 NULL 4 4 4

13 NULL 4 4 4

14 NULL 4 4 4

15 NULL 4 4 4

16 200 5 5 4

Prior to SQL 2012, the COUNT window aggregate function required a PARTITION within the OVER clause.Using ORDER BY instead, we are now forcing the COUNT to operate over a window frame.The result (c2) represents the full syntax for specifying the window frame.The result (c2) also explicitly introduces the window frame but the upper boundary of the set is omitted, defaulting to CURRENT ROW.The final result (c3) simply shows that the default window frame is the one we explicitly used in c1 and c2 (and is the most compact, equivalent syntax).

If we look at the results in any of the c columns, we immediately notice that within each value produced by the COUNT(v), we have only a single known value of v.We can use this calculated result to PARTITION our set.

In order to perform our data smear, now we just need to use a more traditional window aggregate (without the frame), and use one of our c columns as the PARTITION.We’ll use the MAX window aggregate, because we can always be sure (given default SQL Server settings) that the known value will always be greater than all the other unknown (NULL) values.

-- Query 1: Create Group Using Window Frame (GUWF)

SELECT ID, v, s=MAX(v) OVER (PARTITION BY c)

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

So from this we get the results we’d like to see in our smear.

ID v s

1 121 121

2 NULL 121

3 NULL 121

4 312 312

5 NULL 312

6 NULL 312

7 NULL 312

8 123 123

9 NULL 123

10 NULL 123

11 415 415

12 NULL 415

13 NULL 415

14 NULL 415

15 NULL 415

16 200 200

In a previous article (The Performance of the T-SQL Window Functions), I found that sometimes performance using a window aggregate (like MAX/OVER/PARTITION) can be improved by pre-aggregation.Assuming the SQL Optimizer is smart enough to create a good query plan that doesn’t require multiple passes to perform the window frame (COUNT/OVER/ORDER BY), the following query which produces the same results might prove to be a little swifter.

-- Query 2: Create GUWF with pre-aggregation

WITH PartitionedDataPoints AS

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

GroupedDataPoints AS

(

SELECT c, s=MAX(v)

FROM PartitionedDataPoints

GROUP BY c

)

SELECT b.ID, b.v, a.s

FROM GroupedDataPoints a

JOIN PartitionedDataPoints b ON a.c = b.c

ORDER BY ID;

Another approach would be to employ a ROW_NUMBER() OVER the c PARTITION to establish a variable jump-back, row increment to use with LAG.That probably requires an example to make it clear, so first we’ll create the variable jump-back.

SELECT ID, v, rn=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID) - 1

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

Which produces this result:

ID v rn

1 121 121

2 NULL 1

3 NULL 2

4 312 312

5 NULL 1

6 NULL 2

7 NULL 3

8 123 123

9 NULL 1

10 NULL 2

11 415 415

12 NULL 1

13 NULL 2

14 NULL 3

15 NULL 4

16 200 200

For id=”we” aren’t really concerned with the value of rn because we already have a known value for v.For id=”rn=1″ and that gives us the jump-back value to use in LAG to pick up 121 out of the v column.Likewise for IDs 3, 4 and 5.We see rn drop back to 0 at id=”and” then begin incrementing again across each of the NULL following rows.

Now we can put this together with LAG and end up with something that is a bit more elaborate than either of the two prior techniques, but does produce the same results.

-- Query 3: Create GUWF with LAG

SELECT ID, v, s=ISNULL(v, LAG(v, rn) OVER (ORDER BY ID))

FROM

(

SELECT ID, v, rn=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID) - 1

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

We’ll get around to showing the performance of these four alternatives in just a few minutes.

Another Simple Data Smear

Suppose we want something slightly different filling in the NULL values, for example a row number like 1, 2, 3, …The same PARTITION we constructed can be used in this case, and so can the ROW_NUMBER in the preceding example.

SELECT ID, v, s=ISNULL(v, ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID) - 1)

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

Which produces these results:

ID v s

1 121 121

2 NULL 1

3 NULL 2

4 312 312

5 NULL 1

6 NULL 2

7 NULL 3

8 123 123

9 NULL 1

10 NULL 2

11 415 415

12 NULL 1

13 NULL 2

14 NULL 3

15 NULL 4

16 200 200

This could also be done with a Quirky Update, like this:

DECLARE @v INT, @rn INT = 0;

UPDATE #x WITH(TABLOCKX)

SET @rn = CASE WHEN v IS NOT NULL THEN 0 ELSE @rn + 1 END

,v = ISNULL(v, @rn)

OPTION (MAXDOP 1);

SELECT * FROM #x;

Suppose we wanted our smear to decrement through the smeared rows.That is a quite simple modification to the window function version (simply change to ORDER BY ID DESC in the ROW_NUMBER() OVER clause), but would be quite difficult using a Quirky Update.You’d need to reverse the order of the PRIMARY KEY/CLUSTERED INDEX.

A Different Data Smear we can do in SQL 2012

Let’s consider a slightly different case; one that cannot easily be accomplished with a Quirky Update.There are ways to accomplish it without using a window frame, and we’ll show one of those a little later.

Suppose we want to fill in our NULL values with the average of the previous and following values.So the value we’d like to see in rows 2 and 3 is the average of 121 and 312 (=216.5).

The SQL 2012 analytical function LEAD comes to mind.Just like LAG, LEAD requires that we know precisely how many rows we want to jump ahead.Unfortunately, the window frame concept doesn’t really help us here, because we can only jump ahead a fixed number of (or all) rows.

So let’s think some more about LEAD.We can construct a “forward jump value” that varies depending on which row we’re on, just like we created a backwards jump value in the previous example using LAG.We’ll use ROW_NUMBER() once again to accomplish this.

SELECT ID, v, s=MAX(v) OVER (PARTITION BY c)

,n=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID DESC)

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

These results are:

ID v s n

1 121 121 3

2 NULL 121 2

3 NULL 121 1

4 312 312 4

5 NULL 312 3

6 NULL 312 2

7 NULL 312 1

8 123 123 3

9 NULL 123 2

10 NULL 123 1

11 415 415 5

12 NULL 415 4

13 NULL 415 3

14 NULL 415 2

15 NULL 415 1

16 200 200 1

In this result set, we can see that where the value of v is NULL, our value for n is precisely the number of rows we need to jump ahead (LEAD) to get the second of the two numbers we need (s is the other) to calculate our average.

Let’s do that.

SELECT ID, v

,s=CASE

WHEN v IS NOT NULL THEN v

ELSE (s + LEAD(v, n) OVER (ORDER BY ID)) / 2.

END

FROM

(

SELECT ID, v, s=MAX(v) OVER (PARTITION BY c)

,n=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID DESC)

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

This produces a result that smears our average through the NULL values.

ID v s

1 121 121.000000

2 NULL 216.500000

3 NULL 216.500000

4 312 312.000000

5 NULL 217.500000

6 NULL 217.500000

7 NULL 217.500000

8 123 123.000000

9 NULL 269.000000

10 NULL 269.000000

11 415 415.000000

12 NULL 307.500000

13 NULL 307.500000

14 NULL 307.500000

15 NULL 307.500000

16 200 200.000000

Note that had our last row been a NULL value for v, the smear would have produced a NULL result for that last row also.But if you simply wanted to smear the last value through those trailing NULL rows, you could have specified s as the third argument to the LEAD function:

1	LEAD(v, n, s) OVER (ORDER BY ID)

Smearing a Smoothed Result

Suppose we want a bit smoother transition from our trailing value to our following value.This could be particularly appealing if we have many NULL rows between sparse data points.Let’s illustrate this by looking at the first four rows of the above results set.

ID v s Instead we want to smooth our result

1 121 121.000000121.0000

2 NULL 216.500000121 + (1/3) * (312-121) = 184.6667

3 NULL 216.500000121 + (2/3) * (312-121) = 248.3333

4 312 312.000000312.0000

First let’s look at some intermediate results using ROW_NUMBER() and the COUNT window aggregate.

SELECT ID, v, s, m, x

FROM

(

SELECT ID, v, s=MAX(v) OVER (PARTITION BY c)

,n=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID DESC)

,m=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID) - 1

,x=1 + COUNT(CASE WHEN v IS NULL THEN 1 END) OVER (PARTITION BY c)

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

Our intermediate results look pretty useful:

ID v s m x

1 121 121 0 3

2 NULL 121 1 3

3 NULL 121 2 3

4 312 312 0 4

5 NULL 312 1 4

6 NULL 312 2 4

7 NULL 312 3 4

8 123 123 0 3

9 NULL 123 1 3

10 NULL 123 2 3

11 415 415 0 5

12 NULL 415 1 5

13 NULL 415 2 5

14 NULL 415 3 5

15 NULL 415 4 5

16 200 200 0 1

You can see that all three of our calculated columns are what we need to produce the smoothing calculation shown prior.The first, s, is of course our original smear.The columns m and x, when m is divided by x, produce the multiplier we need.We have omitted the LEAD result, but in the end we’ll use that also (so n is likewise still required).

Putting that all together, we get this.

SELECT ID, v

,s=CASE

WHEN v IS NOT NULL THEN v

ELSE s + (1. * m / x) * (LEAD(v, n, s) OVER (ORDER BY ID) - s)

END

FROM

(

SELECT ID, v, s=MAX(v) OVER (PARTITION BY c)

,n=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID DESC)

,m=ROW_NUMBER() OVER (PARTITION BY c ORDER BY ID) - 1

,x=1 + COUNT(CASE WHEN v IS NULL THEN 1 END) OVER (PARTITION BY c)

FROM

(

SELECT ID, v

,c=COUNT(v) OVER (ORDER BY ID)

FROM #X

) a

ORDER BY ID;

And this produces our nicely smoothed data smear:

ID v s

1 121 121.000000

2 NULL 184.666667

3 NULL 248.333333

4 312 312.000000

5 NULL 264.750000

6 NULL 217.500000

7 NULL 170.250000

8 123 123.000000

9 NULL 220.333333

10 NULL 317.666667

11 415 415.000000

12 NULL 372.000000

13 NULL 329.000000

14 NULL 286.000000

15 NULL 243.000000

16 200 200.000000

We have just done a simple linear interpolation in T-SQL!

Performance Comparison of the Basic Data Smear

When I write an article, I always like to include a bit of a performance comparison.Good grief!Was that an expectation I just set? (Ed: No, we had the expectation already.)

Since we have four approaches to our basic data smear, that’s a good candidate.First we’ll construct a one million row test harness and NULL out the v column for approximately 525,000 rows.

CREATE TABLE #X

(

IDINT IDENTITY PRIMARY KEY

,vINT

);

SET NOCOUNT ON;

-- Insert 1,000,000 rows to the test harness

WITH Tally (n) AS

(

SELECT TOP 1000000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a CROSS JOIN sys.all_columns b

)

INSERT INTO #x

SELECT 1+ABS(CHECKSUM(NEWID()))%1000

FROM Tally;

PRINT 'Test harness Row Count: ' + CAST(@@ROWCOUNT AS VARCHAR(10));

-- NULL the v column out for approximately 525,000 rows (results will vary)

WITH Tally (n) AS

(

SELECT TOP 750000 1+ABS(CHECKSUM(NEWID()))%1000000

FROM sys.all_columns a CROSS JOIN sys.all_columns b

)

UPDATE #x

SET v = NULL

WHERE ID IN

(

SELECT n

FROM Tally

);

PRINT 'NULL Row Count: ' + CAST(@@ROWCOUNT AS VARCHAR(10));

The rest of the test harness, queries labelled 1-3 above plus the QU as query #4 can be found in the attached resources file, so you can run it yourself to see the benchmark on your system.Our results looked like this:

Test harness Row Count: 1000000

NULL Row Count: 527213

Query 1 Elapsed Time (ms): 17470

Query 2 Elapsed Time (ms): 41296

Query 3 Elapsed Time (ms): 29770

Query 4 (QU) Elapsed Time (ms): 1130

The QU was run last because once it updates the data smear into the existing column, you can’t go back.Obviously the QU is much faster for this case, which is not a surprise considering it has long held the reputation of being the fastest approach for the running totals problem.What was a bit of a surprise was that Query 2 was quite a bit slower than Query 1, with the LAG approach (Query 3) coming in around the middle of the three SQL 2012 solutions.

Let’s look at our smearing of the average value across the unknown rows.Here’s a traditional method that accomplishes the same thing.

SELECT ID, v, s=ISNULL(CAST(v AS DECIMAL(12,6)), (p+f)/2.)

FROM #x a

OUTER APPLY

(

SELECT TOP 1 p=v

FROM #x b

WHERE b.v IS NOT NULL AND b.ID < a.ID

ORDER BY ID DESC

) b6

OUTER APPLY

(

SELECT TOP 1 f=v

FROM #x c

WHERE c.v IS NOT NULL AND c.ID > a.ID

ORDER BY ID

) c

ORDER BY ID;

Using the same basic test harness (a second resource file has been provided containing it), we get the followingresults.In this case, we’ve added some INDEXing to see what effect it would have, whereas we didn’t dare do so for the data smear above using the Quirky Update.

Test harness Row Count: 1000000

NULL Row Count: 527143

Query 1: Smear average with a window frame (ms): 37073

Query 2: Smear average using 2 OUTER APPLYs (ms): 2750

CREATE INDEX x_ix1 ON #x(v);

Test harness Row Count: 1000000

NULL Row Count: 527815

Query 1: Smear average with a window frame (ms): 30170

Query 2: Smear average using 2 OUTER APPLYs (ms): 2620

CREATE INDEX x_ix1 ON #x(v) INCLUDE(ID);

Test harness Row Count: 1000000

NULL Row Count: 527557

Query 1: Smear average with a window frame (ms): 35983

Query 2: Smear average using 2 OUTER APPLYs (ms): 3033

This isn’t looking very good at all for our window frame techniques! At least not for solving this particular problem in the way that I did. The traditional method for this more elaborate smear even outperforms the basic smear using the window function.

I think the most surprising part of these results was that the traditional approach bested the window frame approach by an order of magnitude.

Conclusions

While the window frame syntax of the OVER clause does offer new approaches to tackling traditional query problems, we have found (somewhat to our surprise) that they don’t always yield better performing results.

Even though the Quirky Update has its detractors, it is still the reigning performance champion on our simple data smears.Just please take care if you are one of its supporters, and take steps to ensure that you re-test it if you ever upgrade your SQL Server.

For the case of smearing our average, it was quite easy to come up with a solution that runs in SQL 2005 that outperformed the window frame alternative.

Finally, we’ll leave it as a challenge to our interested readers to come up with a good performing alternative to the linear interpolation data smear.Given the results we got, we’re thinking that might not be too hard.

It is always exhilarating to explore new techniques to traditional query problems.The learning experience of exploring these window frame variants, and reporting on the facts that we can conclude from them, is justification in itself for writing an article of this sort.

Just please remember, when you come up with new alternatives, you should always test, test and then test again to make sure your result meets the performance expectations you should always set for yourself.

Register for Simple Talk

Filling In Missing Values Using the T-SQL Window Frame

SQL Server gets the full set of Window Aggregate functions.

The Data Smear Problem in SQL

Quirky Update

Some Sample Data and the Quirky Update Solution

The Data Smear Using Microsoft SQL Server 2012

Another Simple Data Smear

A Different Data Smear we can do in SQL 2012

Smearing a Smoothed Result

Performance Comparison of the Basic Data Smear

Conclusions

Article tags

About the author

Dwain Camps

Dwain Camps's contributions

Articles

Books

Top topics

Dwain Camps's latest contributions:

How to Avoid Conditional JOINs in T-SQL

Software Engineering: Just How Immature is it?

Using the T-SQL PERCENTILE Analytic Functions in SQL Server 2000, 2005 and 2008

SQL Server gets the full set of Window Aggregate functions.

The Data Smear Problem in SQL

Quirky Update

Some Sample Data and the Quirky Update Solution

The Data Smear Using Microsoft SQL Server 2012

Another Simple Data Smear

A Different Data Smear we can do in SQL 2012

Smearing a Smoothed Result

Performance Comparison of the Basic Data Smear

Conclusions

Article tags

Recommended

About the author

Dwain Camps

Dwain Camps's contributions

Articles

Books

Top topics

Dwain Camps's latest contributions:

How to Avoid Conditional JOINs in T-SQL

Software Engineering: Just How Immature is it?

Using the T-SQL PERCENTILE Analytic Functions in SQL Server 2000, 2005 and 2008