Introduction to T-SQL Window Functions

I’ve spent the past six years traveling around the US telling database professionals about T-SQL Window Functions at SQL Saturdays and other events. I’m amazed at how few people have heard about these functions and even fewer who are using them. At the end of each presentation, one or more people come up to say that they wished they learned about these functions years earlier because they could have been beneficial for so many queries.

These functions have been promoted to improve performance over other, more traditional methods. I partially agree. They make many queries easier to write, and, sometimes, they improve performance.

Nothing to do with the Windows OS

These functions are part of the ANSI SQL 2003 Standards and, in the case of SQL Server, are T-SQL functions used to write queries. They have nothing to do with the Windows operating system or any API calls. Other database systems, such as Oracle, have also included these as part of their own SQL language.

Window (also, windowing or windowed) functions perform a calculation over a set of rows. I like to think of “looking through the window” at the rows that are being returned and having one last chance to perform a calculation. The window is defined by the OVER clause which determines if the rows are partitioned into smaller sets and if they are ordered. In fact, if you use a window function you will always use an OVER clause. The OVER clause is also part of the NEXT VALUE FOR syntax required for the sequence object, but, otherwise it’s used with window functions.

The OVER clause may contain a PARTITION BY option. This breaks the rows into smaller sets. You might think that this is the same as GROUP BY, but it’s not. When grouping, one row per unique group is returned. When using PARTITION BY, all of the detail rows are returned along with the calculations. If you have a window in your home that is divided into panes, each pane is a window. When thinking about window functions, the entire set of results is a partition, but when using PARTITION BY, each partition can also be considered a window. PARTITION BY is supported – and optional – for all windowing functions.

The OVER clause may also contain an ORDER BY option. This is independent of the ORDER BY clause of the query. Some of the functions require ORDER BY, and it’s not supported by the others. When the order of the rows is important when applying the calculation, the ORDER BY is required.

Window functions may be used only in the SELECT and ORDER BY clauses of a query. They are applied after any joining, filtering, or grouping.

Ranking Functions

The most commonly used window functions, ranking functions, have been available since 2005. That’s when Microsoft introduced ROW_NUMBER, RANK, DENSE_RANK, and NTILE. ROW_NUMBER is used very frequently, to add unique row numbers to a partition or to the entire result set. Adding a row number, or one of the other ranking functions, is not usually the goal, but it is a step along the way to the solution.

ORDER BY is required in the OVER clause when using ROW_NUMBER and the other functions in this group. This tells the database engine the order in which the numbers should be applied. If the values of the columns or expressions used in the ORDER BY are not unique, then RANK and DENSE_RANK will deal with the ties, while ROW_NUMBER doesn’t care about ties. NTILE is used to divide the rows into buckets based on the ORDER BY.

One benefit of ROW_NUMBER is the ability to turn non-unique rows into unique rows. This could be used to eliminate duplicate rows, for example.

To show how this works, start with a temp table containing duplicate rows. The first step is to create the table and populate it.

CREATE TABLE #Duplicates(Col1 INT, Col2 CHAR(1));

INSERT INTO #Duplicates(Col1, Col2)

VALUES(1,'A'),(2,'B'),(2,'B'),(2,'B'),

(3,'C'),(4,'D'),(4,'D'),(5,'E'),

(5,'E'),(5,'E');

SELECT * FROM #Duplicates;

Adding ROW_NUMBER and partitioning by each column will restart the row numbers for each unique set of rows. You can identify the unique rows by finding those with a row number equal to one.

SELECT Col1, Col2,

ROW_NUMBER() OVER(PARTITION BY Col1, Col2 ORDER BY Col1) AS RowNum

FROM #Duplicates;

Now, all you have to do is to delete any rows that have a row number greater than one. The problem is that you cannot add window functions to the WHERE clause.

1 2	DELETE #Duplicates WHERE ROW_NUMBER() OVER(PARTITION BY Col1, Col2 ORDER BY Col1) <> 1;

You’ll see this error message:

The way around this problem is to separate the logic using a common table expression (CTE). You can then delete the rows right from the CTE.

WITH Dupes AS (

SELECT Col1, Col2,

ROW_NUMBER() OVER(PARTITION BY Col1, Col2 ORDER BY Col1) AS RowNum

FROM #Duplicates)

DELETE Dupes

WHERE RowNum <> 1;

SELECT * FROM #Duplicates;

Success! The extra rows were deleted, and a unique set of rows remains.

To see the difference between ROW_NUMBER, RANK, and DENSE_RANK, run this query:

USE Adventureworks2017; --Or whichever version you have

SELECT SalesOrderID, OrderDate, CustomerID,

ROW_NUMBER() OVER(ORDER BY OrderDate) As RowNum,

RANK() OVER(ORDER BY OrderDate) As Rnk,

DENSE_RANK() OVER(ORDER BY OrderDate) As DenseRnk

FROM Sales.SalesOrderHeader

WHERE CustomerID = 11330;

The ORDER BY for each OVER clause is OrderDate which is not unique. This customer placed two orders on 2013-10-24. ROW_NUMBER just continued assigning numbers and didn’t do anything different even though there is a duplicate date. RANK assigned 6 to both rows and then caught up to ROW_NUMBER with an 8 on the next row. DENSE_RANK also assigned 6 to the two rows but assigned 7 to the following row.

Two explain the difference, think of ROW_NUMBER as positional. RANK is both positional and logical. Those two rows are ranked logically the same, but the next row is ranked by the position in the set. DENSE_RANK ranks them logically. Order 2013-11-04 is the 7^th unique date.

The final function in this group is called NTILE. It assigns bucket numbers to the rows instead of row numbers or ranks. Here is an example:

SELECT SP.FirstName, SP.LastName,

SUM(SOH.TotalDue) AS TotalSales,

NTILE(4) OVER(ORDER BY SUM(SOH.TotalDue)) * 1000 AS Bonus

FROM [Sales].[vSalesPerson] SP

JOIN Sales.SalesOrderHeader SOH

ON SP.BusinessEntityID = SOH.SalesPersonID

WHERE SOH.OrderDate >= '2012-01-01' AND SOH.OrderDate < '2013-01-01'

GROUP BY FirstName, LastName;

NTILE has a parameter, in this case 4, which is the number of buckets you want to see in the results. The ORDER BY is applied to the sum of the sales. The rows with the lowest 25% are assigned 1, the rows with the highest 25% are assigned 4. Finally, the results of NTILE are multiplied by 1000 to come up with the bonus amount. Since 14 cannot be evenly divided by 4, an extra row goes into each of the first two buckets.

Window Aggregates

Window aggregates were also introduced with SQL Server 2005. These make writing some tricky queries easy but will often perform worse than older techniques. They allow you to add your favourite aggregate function to a non-aggregate query. Say, for example you would like to display all the customer orders along with the subtotal for each customer. By adding a SUM using the OVER clause, you can accomplish this very easily:

SELECT CustomerID, OrderDate, SalesOrderID, TotalDue,

SUM(TotalDue) OVER(PARTITION BY CustomerID) AS SubTotal

FROM Sales.SalesOrderHeader;

By adding the PARTITION BY, a subtotal is calculated for each customer. Any aggregate function can be used, and ORDER BY in the OVER clause is not supported.

Window Aggregate Enhancements in 2012

Beginning with 2012, you can add an ORDER BY to the OVER clause to window aggregates to produce running totals and moving averages, for example. At the same time, Microsoft introduced the concept of framing. Adding a PARTITION BY is like dividing a window into panes. Adding framing is like creating a stained-glass window. Each row has an individual window where the expression will be applied.

With this enhancement, you can create running totals even without adding the framing syntax. Here is an example that returns a running total by customer:

SELECT CustomerID, OrderDate, SalesOrderID, TotalDue,

SUM(TotalDue) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID)

AS RunningTotal

FROM Sales.SalesOrderHeader;

The default frame, which is used if a frame is not specified, is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. Unfortunately, this will not perform as well as if you specify this frame instead: ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. The difference is the word ROWS. RANGE is only partially implemented at this time, and it’s meant for working with periods of time, while ROWS is positional. The frame, ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, means that the window consists of the first row of the partition and all the rows up to the current row. Each calculation is done over a different set of rows. For example, when performing the calculation for row 4, the rows 1 to 4 are used.

When performing the calculation for row 5, the rows are 1 to 5. The window grows larger as you move from one row to the next.

You can also use the syntax ROWS BETWEEN N PRECEEDING AND CURRENT ROW or ROWS BETWEEN CURRENT ROW AND N FOLLOWING. This could be useful for calculating a three-month moving average, for example. The following figure represents ROWS BETWEEN 2 PRECEDING AND CURRENT ROW.

When 5 is the current row, the window moves; it doesn’t change size.

Here is the list of terms you need to know when writing the framing option:

I admit that this syntax is a bit confusing but using SQL Prompt helps makes writing the framing option easier!

Offset Functions

Also included with the release of SQL Server 2012 are four functions that allow you to include values from other rows – without doing a self-join. Microsoft calls these ‘analytic functions’, but I always refer to them as ‘offset functions’ when presenting on this topic. Two of the functions allow you to pull columns or expressions from a row before (LAG) or after (LEAD) the current row. The other two functions allow you to return values from the first row of the partition (FIRST_VALUE) or last row of the partition (LAST_VALUE). FIRST_VALUE and LAST_VALUE also require framing, so be sure to include the frame when using these functions. All four of the functions require the ORDER BY option of the OVER clause. That makes sense, because the database engine must know the order of the rows to figure out which row contains the value to return.

Some people have a favourite band; some people have a favourite movie. I have a favourite function – LAG. It’s easy to use (no frame!) and performs great. Here is an example:

SELECT CustomerID, OrderDate, SalesOrderID,

LAG(SalesOrderID) OVER(PARTITION BY CustomerID ORDER BY SalesOrderID

) AS PrevOrder

FROM Sales.SalesOrderHeader

ORDER BY CustomerID;

LAG and LEAD require an argument – the column or expression you want to return. By default, LAG returns the value from the previous row, and LEAD returns the value from the following row. You can modify that by supplying a value for the OFFSET parameter, which is 1 by default. Notice that the first row of the partition returns NULL. If you wish to override the NULLs, you can supply a DEFAULT value. Here is a similar query that goes back two rows and has a default value:

SELECT CustomerID, OrderDate, SalesOrderID,

LAG(SalesOrderID,2,0) OVER(PARTITION BY CustomerID

ORDER BY SalesOrderID) AS Back2Orders

FROM Sales.SalesOrderHeader;

FIRST_VALUE and LAST_VALUE can be used to find a value from the very first row or very last row of the partition. Be sure to specify the frame, not only for performance reasons, but because the default frame doesn’t work as you would expect with LAST_VALUE. The default frame, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW, only goes up to the current row. The last row of the partition is not included. To get the expected results, be sure to specify ROWS BETWEEN CURRENT ROW AND UNBOUNDED FOLLOWING when using LAST_VALUE. Here is an example using FIRST_VALUE:

SELECT CustomerID, OrderDate, SalesOrderID,

FIRST_VALUE(SalesOrderID) OVER(PARTITION BY CustomerID

ORDER BY SalesOrderID

ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS FirstOrder

FROM Sales.SalesOrderHeader;

Statistical Functions

Microsoft groups these four functions – PERCENT_RANK, CUME_DIST, PERCENTILE_DISC, PERCENTILE_CONT – along with the offset functions calling all eight the analytic functions. Since I like to distinguish these from the offset functions, I call these statistical.

PERCENT_RANK and CUME_DIST provide a ranking for each row over a partition. They differ slightly. PERCENT_RANK returns the percentage of rows that rank lower than the current row. “My score is higher than 90% of the scores.” CUME_DIST, or cumulative distribution, returns the exact rank. “My score is at 90% of the scores.” Here is an example using the average high temperature in St. Louis for each month. Note that the ranks were determined by the Fahrenheit temperature.

CREATE TABLE #MonthlyTempsStl(MName varchar(15), AvgHighTempF INT,

AvgHighTempC INT)

INSERT INTO #MonthlyTempsStl(MName, AvgHighTempF, AvgHighTempC)

VALUES('Jan',40,4),('Feb',45, 7),('Mar',56, 13),('Apr',67, 20),

('May',76,25),('Jun',85,30),('Jul',89,32),('Aug',88,31),

('Sep',80,27),('Oct',69,20),('Nov',56,13),('Dec',43,6);

SELECT MName, AvgHighTempF,AvgHighTempC,

RANK() OVER(ORDER BY AvgHighTempF) AS Rank,

PERCENT_RANK() OVER(ORDER BY AvgHighTempF) AS PercentRank,

CUME_DIST() OVER(ORDER BY AvgHighTempF) AS CumeDist

FROM #MonthlyTempsStl;

The ranks are not determined by the relative values, but by the positions of the rows. Notice that March and November have the same average high temp, so they were ranked the same.

You may be wondering how to calculate PERCENT_RANK and CUME_DIST. Here are the formulas:

1 2	PERCENT_RANK = (Rank -1)/(Row count -1) CUME_DIST = (Rank)/(Row count)

PERCENTILE_DISC and PERCENTILE_CONT work in the opposite way. Given a percent rank, find the value at that rank. They differ in that PERCENTILE_DISC will return a value that exists in the set while PERCENTILE_CONT will calculate an exact value if none of the values in the set falls precisely at that rank. You can use PERCENTILE_CONT to calculate a median by supplying 0.5 as the percent rank. For example, which temperature ranks at 50% in St. Louis?

SELECT MName, AvgHighTempF,AvgHighTempC,

RANK() OVER(ORDER BY AvgHighTempF) AS Rank,

PERCENTILE_CONT(0.5) WITHIN GROUP(ORDER BY AvgHighTempF)

OVER() AS PercentileCont,

PERCENTILE_DISC(0.5) WITHIN GROUP(ORDER BY AvgHighTempF)

OVER() AS PercentileDisc

FROM #MonthlyTempsStl;

The PERCENTILE_CONT function takes the average of the two values closest to the middle, 67 and 69, and averages them. PERCENTILE_DISC returns an exact value, 67. Also notice that these two functions have an extra clause not seen in the other functions, WITHIN GROUP, that contains the ORDER BY instead of within the OVER clause.

Summary

This article is a very quick overview of T-SQL window functions. Two types of functions were released with SQL Server 2005, the ranking functions and window aggregates. With 2012, you have enhanced window aggregate with framing and the analytic functions. I like to separate the analytic functions into two groups, the offset and statistical functions. Window functions make many queries easier to write, and I believe that is the main benefit. In some cases, the queries will perform better, too, but that is a discussion for another day.

I hope this article has inspired you to learn more about these fantastic functions!

Register for Simple Talk

Introduction to T-SQL Window Functions

Nothing to do with the Windows OS

Ranking Functions

Window Aggregates

Window Aggregate Enhancements in 2012

Offset Functions

Statistical Functions

Summary

Article tags

About the author

Kathi Kellenberger

Kathi's contributions

Articles

Books

Top topics

Kathi's latest contributions:

Introduction to GitHub Actions Part 3: Self-hosted runners

Introduction to GitHub Actions Part 2: GitHub-hosted runners

Introduction to GitHub Actions Part 1: Getting Started

Nothing to do with the Windows OS

Ranking Functions

Window Aggregates

Window Aggregate Enhancements in 2012

Offset Functions

Statistical Functions

Summary

Article tags

Recommended

About the author

Kathi Kellenberger

Kathi's contributions

Articles

Books

Top topics

Kathi's latest contributions:

Introduction to GitHub Actions Part 3: Self-hosted runners

Introduction to GitHub Actions Part 2: GitHub-hosted runners

Introduction to GitHub Actions Part 1: Getting Started