With the release of SQL Server 2005, Transact-SQL included limited support for a subset of set functions referred to as window functions. A set function is one that can be applied to a set of rows. By extension, a window function is one that can be applied to a partitioned set of rows (known as a window) in order to rank or aggregate values in that partition. Window functions have nothing to do with Microsoft Windows, but they do provide a way to work with data grouped into logical windows.
You implement window functions as part of a query’s SELECT expression. The window function can be a ranking function (described below) or aggregate function such as SUM or AVG. The function is then followed by an OVER clause that determines how data is partitioned or ordered. The way in which the OVER clause is implemented depends in part on whether you’re using a ranking window function or aggregate window function. The following syntax shows how both types of functions-along with their supporting OVER clause-are implemented in SQL Server:
1 2 3 4 5 6 7 8 9 |
<over_clause>::= ranking_function <over_clause_ranking> | aggregate_function <over_clause_aggregate> <over_clause_ranking>::= OVER([PARTITION BY column, ... [n]] <order_by_clause>) <over_clause_aggregate>::= OVER([PARTITION BY column, ... [n]]) |
As the syntax indicates, the OVER clause, when used with a ranking function, includes an optional PARTITION BY clause and a required ORDER BY clause. However, when used with an aggregate function, the OVER clause must include a PARTITION BY clause but cannot include an ORDER BY clause. The components that make up each type of clause and the differences between them will become clearer as we work through the article.
To that end, I’ve created several examples based on the AdventureWorks2008 database (implemented on a local instance of SQL Server 2008). If you want to run the examples against the AdventureWorks database in SQL Server 2005, be sure to change the database name wherever it’s mentioned in the code samples. In addition, you must change the name of the BusinessEntityID column to SalesPersonID where applicable.
Working with Ranking Window Functions
As the name suggests, ranking functions let you rank the rows in your result set based on specified values in those rows. SQL Server supports four ranking functions:
- ROW_NUMBER: Assigns a sequential number to each row in the result set.
- RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. However, the next number in the ranking sequence is skipped.
- DENSE_RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. The next number in the ranking sequence is then used to rank the row or rows that follow.
- NTILE: Divides the result set into the number of groups specified as an argument to the function. A group number is then assigned to each row identifying which group the row belongs to.
All ranking functions begin with the number 1 when assigning values and rank the data based on the column specified in the mandatory ORDER BY clause. As you saw in the syntax above, for each ranking function, you must include an OVER clause that contains the ORDER BY clause. In the ORDER BY clause, you identify the column on which to base the ranking. You also specify whether the rows should be sorted in ascending or descending order.
In the following SELECT statement, I use all four ranking functions to rank data based on the SalesLastYear column:
1 |
USE AdventureWorks2008; GOSELECT BusinessEntityID AS SalesID, FirstName + ' ' + LastName AS FullName, SalesLastYear, ROW_NUMBER() OVER(ORDER BY SalesLastYear ASC) AS RowNumber, RANK() OVER(ORDER BY SalesLastYear ASC) AS SalesRank, DENSE_RANK() OVER(ORDER BY SalesLastYear ASC) AS DenseRank, NTILE(4) OVER(ORDER BY SalesLastYear ASC) AS NTileRankFROM Sales.vSalesPerson; |
Notice that each ranking function is used as part of a column expression in the SELECT list and that only the NTILE function takes an argument (4), which in this case identifies the number of groups into which to divide the result set.
Following the function is the OVER clause. In each case, the OVER clause includes an ORDER BY clause that specifies the SalesLastYear column. In addition, the clause indicates that the sorting order is ascending (ASC). If you want the rows to be sorted in descending order, you must use DESC. The following table shows the results returned by the SELECT statement:
SalesID |
FullName |
SalesLastYear |
RowNumber |
SalesRank |
DenseRank |
NTileRank |
274 |
Stephen Jiang |
0.00 |
1 |
1 |
1 |
1 |
284 |
Tete Mensa-Annan |
0.00 |
2 |
1 |
1 |
1 |
285 |
Syed Abbas |
0.00 |
3 |
1 |
1 |
1 |
287 |
Amy Alberts |
0.00 |
4 |
1 |
1 |
1 |
288 |
Rachel Valdez |
1307949.7917 |
5 |
5 |
2 |
1 |
283 |
David Campbell |
1371635.3158 |
6 |
6 |
3 |
2 |
276 |
Linda Mitchell |
1439156.0291 |
7 |
7 |
4 |
2 |
278 |
Garrett Vargas |
1620276.8966 |
8 |
8 |
5 |
2 |
289 |
Jae Pak |
1635823.3967 |
9 |
9 |
6 |
2 |
275 |
Michael Blythe |
1750406.4785 |
10 |
10 |
7 |
3 |
279 |
Tsvi Reiter |
1849640.9418 |
11 |
11 |
8 |
3 |
280 |
Pamela Ansman-Wolfe |
1927059.178 |
12 |
12 |
9 |
3 |
277 |
Jillian Carson |
1997186.2037 |
13 |
13 |
10 |
3 |
282 |
José Saraiva |
2038234.6549 |
14 |
14 |
11 |
4 |
281 |
Shu Ito |
2073505.9999 |
15 |
15 |
12 |
4 |
286 |
Lynn Tsoflias |
2278548.9776 |
16 |
16 |
13 |
4 |
290 |
Ranjit Varkey Chudukatil |
2396539.7601 |
17 |
17 |
14 |
4 |
Each function ranks the results based on the values in SalesLastYear column (sorted in ascending order). The ROW_NUMBER function assigns a sequential number to each row according to that order. Because the result set returns 17 rows, the function assigns the values 1 through 17 to those rows, starting with the first SalesLastYear value (0.00).
The RANK function also starts with the number 1. However, because the first four rows are the same value (0.00), each row is ranked the same. The next value in SalesLastYear (1307949.7917) is located in the fifth row of the result set, so it receives a rank of 5. The function skips 2 through 4 because 1 is used four times. In addition, because none of the remaining values are repeated, the rest of the result set is ranked sequentially from 6 through 17.
The DENSE_RANK function, like the RANK function, also assigns the number 1 to the first four rows. However, on the fifth row, the function assigns a number 2, rather than skipping 5, and then ranks the rest of the result set sequentially. Consequently, the function assigns only the numbers 1 through 14 because 1 is repeated four times.
The NTILE function works differently from the other functions. Because 4 is specified as the argument to the function, the result set is divided into four groups. That division is based on the total number of rows divided by the number in the argument. That means, in this case, the first group includes five rows and the other three groups include four. The numbers 1 through 4 are assigned to the rows based on this division.
As you’ll recall from the syntax, the OVER clause can also include a PARTITION BY clause, which identifies how to partition the data into windows. In the following example, I include a PARTITION BY clause for each rank function in order to partition the data based on the values in the TerritoryGroup column:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 |
USE AdventureWorks2008; GO SELECT BusinessEntityID AS SalesID, TerritoryGroup, SalesLastYear, ROW_NUMBER() OVER(PARTITION BY TerritoryGroup ORDER BY SalesLastYear ASC) AS RowNumber, RANK() OVER(PARTITION BY TerritoryGroup ORDER BY SalesLastYear ASC) AS SalesRank, DENSE_RANK() OVER(PARTITION BY TerritoryGroup ORDER BY SalesLastYear ASC) AS DenseRank, NTILE(2) OVER(PARTITION BY TerritoryGroup ORDER BY SalesLastYear ASC) AS NTileRank FROM Sales.vSalesPerson; |
This example is nearly identical to the preceding example except for the addition of the PARTITION BY clause to each OVER clause. (I also changed the NTILE argument from 4 to 2.) The result set is now grouped according to values in the TerritoryGroup column. The ranking functions are then applied to each group, as shown in the following example:
SalesID |
TerritoryGroup |
SalesLastYear |
RowNumber |
SalesRank |
DenseRank |
NTileRank |
274 |
NULL |
0.00 |
1 |
1 |
1 |
1 |
285 |
NULL |
0.00 |
2 |
1 |
1 |
1 |
287 |
NULL |
0.00 |
3 |
1 |
1 |
2 |
288 |
Europe |
1307949.7917 |
1 |
1 |
1 |
1 |
289 |
Europe |
1635823.3967 |
2 |
2 |
2 |
1 |
290 |
Europe |
2396539.7601 |
3 |
3 |
3 |
2 |
284 |
North America |
0.00 |
1 |
1 |
1 |
1 |
283 |
North America |
1371635.3158 |
2 |
2 |
2 |
1 |
276 |
North America |
1439156.0291 |
3 |
3 |
3 |
1 |
278 |
North America |
1620276.8966 |
4 |
4 |
4 |
1 |
275 |
North America |
1750406.4785 |
5 |
5 |
5 |
1 |
279 |
North America |
1849640.9418 |
6 |
6 |
6 |
2 |
280 |
North America |
1927059.178 |
7 |
7 |
7 |
2 |
277 |
North America |
1997186.2037 |
8 |
8 |
8 |
2 |
282 |
North America |
2038234.6549 |
9 |
9 |
9 |
2 |
281 |
North America |
2073505.9999 |
10 |
10 |
10 |
2 |
286 |
Pacific |
2278548.9776 |
1 |
1 |
1 |
1 |
Because the TerritoryGroup column contains NULL values, those values form their own group, and the ranking functions are applied to that group. So ROW_NUMER assigns the values of 1 through 3 to the NULL rows, RANK and DENSE_RANK assign a value of 1 to each of the three rows (because the SalesLastYear values are the same), and NTILE splits the three rows into two groups, with two rows in the first group and one row in the second. This process is then repeated for each group, regardless of the number of rows.
Working with Aggregate Window Functions
In addition to applying ranking functions to partitioned data, you can also apply aggregate functions. The primary difference between the two-in terms of how you implement the functions-is that you cannot use the ORDER BY clause in the OVER clause associated with an aggregate function. A more important difference, however, is in the result set. An aggregate function is applied to all rows within the partition, rather than individual rows, with regard to calculating the aggregated data.
Let’s look at an example to demonstrate how this works. In the following SELECT statement, I partition the result set by the TerritoryName column and then apply the COUNT, SUM, and AVG aggregate functions to the SalesLastYear values in each partition:
1 2 3 4 5 6 7 8 9 10 11 12 |
USE AdventureWorks2008; GO SELECT TerritoryGroup, TerritoryName, SalesLastYear, COUNT(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesCnt, SUM(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesTtl, AVG(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesAvg FROM Sales.vSalesPerson WHERE TerritoryGroup IS NOT NULL; |
As you can see, I pass the SalesLastYear column in as an argument to the aggregate functions. The functions will then be applied to each partition, as shown in the following results:
TerritoryGroup |
TerritoryName |
SalesLastYear |
SalesCnt |
SalesTtl |
SalesAvg |
Pacific |
Australia |
2278548.9776 |
1 |
2278548.9776 |
2278548.9776 |
North America |
Canada |
1620276.8966 |
2 |
3658511.5515 |
1829255.7757 |
North America |
Canada |
2038234.6549 |
2 |
3658511.5515 |
1829255.7757 |
North America |
Central |
1997186.2037 |
1 |
1997186.2037 |
1997186.2037 |
Europe |
France |
2396539.7601 |
1 |
2396539.7601 |
2396539.7601 |
Europe |
Germany |
1307949.7917 |
1 |
1307949.7917 |
1307949.7917 |
North America |
Northeast |
1750406.4785 |
1 |
1750406.4785 |
1750406.4785 |
North America |
Northwest |
1927059.178 |
3 |
3298694.4938 |
1099564.8312 |
North America |
Northwest |
1371635.3158 |
3 |
3298694.4938 |
1099564.8312 |
North America |
Northwest |
0.00 |
3 |
3298694.4938 |
1099564.8312 |
North America |
Southeast |
1849640.9418 |
1 |
1849640.9418 |
1849640.9418 |
North America |
Southwest |
1439156.0291 |
2 |
3512662.029 |
1756331.0145 |
North America |
Southwest |
2073505.9999 |
2 |
3512662.029 |
1756331.0145 |
Europe |
United Kingdom |
1635823.3967 |
1 |
1635823.3967 |
1635823.3967 |
The result set is, you’ll have noticed, grouped by the TerritoryName values. The aggregate functions are then applied to each row of the group, based on all the rows in the partition. For example, the Northwest territory group includes three rows. The SalesLastYear values in those rows range from 0.00 to 1927059.178. Not surprisingly, the COUNT function returns a value of 3. Notice, however, that this value appears in all three rows. The same is true for the SUM and AVG functions (3298694.4938 and 1099564.8312, respectively), with regard to the values being repeated in each row.
Although this is a lot of repetitive data, the structure of the SELECT statement itself is still much simpler than it would be if you could not take advantage of the aggregate window functions. However, I think the real strength in using window aggregate functions is when you include one of the functions in a larger column expression.
For instance, in the following example I’ve included an expression in the SELECT list that divides the SalesLastYear value by the total sales for each territory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
USE AdventureWorks2008; GO SELECT FirstName + ' ' + LastName AS FullName, TerritoryName, SalesLastYear, COUNT(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesCnt, SUM(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesTtl, AVG(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesAvg, SalesLastYear / SUM(SalesLastYear) OVER(PARTITION BY TerritoryName) AS SalesPct FROM Sales.vSalesPerson WHERE TerritoryGroup IS NOT NULL; |
Notice that I use the SUM aggregate function to calculate SalesLastYear values based on the TerritoryName partitions. That way, I can determine what percentage of a territory’s sales an individual salesperson generates, as shown in the following results:
FullName |
TerritoryName |
SalesLastYear |
SalesCnt |
SalesTtl |
SalesAvg |
SalesPct |
Lynn Tsoflias |
Australia |
2278548.9776 |
1 |
2278548.9776 |
2278548.9776 |
1.00 |
Garrett Vargas |
Canada |
1620276.8966 |
2 |
3658511.5515 |
1829255.7757 |
0.4428 |
José Saraiva |
Canada |
2038234.6549 |
2 |
3658511.5515 |
1829255.7757 |
0.5571 |
Jillian Carson |
Central |
1997186.2037 |
1 |
1997186.2037 |
1997186.2037 |
1.00 |
Ranjit Varkey Chudukatil |
France |
2396539.7601 |
1 |
2396539.7601 |
2396539.7601 |
1.00 |
Rachel Valdez |
Germany |
1307949.7917 |
1 |
1307949.7917 |
1307949.7917 |
1.00 |
Michael Blythe |
Northeast |
1750406.4785 |
1 |
1750406.4785 |
1750406.4785 |
1.00 |
Pamela Ansman-Wolfe |
Northwest |
1927059.178 |
3 |
3298694.4938 |
1099564.8312 |
0.5841 |
David Campbell |
Northwest |
1371635.3158 |
3 |
3298694.4938 |
1099564.8312 |
0.4158 |
Tete Mensa-Annan |
Northwest |
0.00 |
3 |
3298694.4938 |
1099564.8312 |
0.00 |
Tsvi Reiter |
Southeast |
1849640.9418 |
1 |
1849640.9418 |
1849640.9418 |
1.00 |
Linda Mitchell |
Southwest |
1439156.0291 |
2 |
3512662.029 |
1756331.0145 |
0.4097 |
Shu Ito |
Southwest |
2073505.9999 |
2 |
3512662.029 |
1756331.0145 |
0.5902 |
Jae Pak |
United Kingdom |
1635823.3967 |
1 |
1635823.3967 |
1635823.3967 |
1.00 |
As you can see, the results include a percentage of sales for each salesperson. For example, in the Southwest group, Linda Mitchell generated about 41% of the sales, and Shu Ito about 59%. On the other hand, Lynn Tsoflias in the Australia group generated 100% of the sales because she’s the only one in the group. Although the result set also shows the COUNT, SUM, and AVG totals, I’ve included them only to verify the data. Ultimately, you can create your statements to include only the calculated data you need.
Moving Forward
At the beginning of the article, I mentioned that SQL Server includes only limited support for window functions, which in this case are the ranking and aggregate functions. However, the window functions feature set, as defined in the ANSI SQL standards, is far more extensive than what SQL Server currently supports. Here’s a sample of the type of functionality that Microsoft has yet to implement:
- Support for the ORDER BY clause when working with aggregate functions
- Implementation of the DISTINCT clause when working with aggregate functions
- Implementation of the ROWS and RANGE framing functions to help define the window frame type
- Implementation of the LAG and LEAD offset functions to compare values based on relative positions
- Implementation of the FIRST_VALUE and LAST_VALUE offset functions to work with ranked rows
- Implementation of the WINDOW clause to reuse the window definition
A number of SQL Server developers have requested that Microsoft include these and other window function enhancements in their next release of SQL Server. For many, extending this feature set is one of their top priorities for the next implementation of Transact-SQL. Until then, we must be satisfied with the limited functionality that’s currently supported, which includes the ability to rank and aggregate partitioned data. That said, the features I’ve shown you can be useful tools when you need to return this type of data. For more information on how to implement window functions, particularly the OVER clause, check out the topic “OVER Clause (Transact-SQL)” in SQL Server Books Online.
Load comments