Working with Window Functions in SQL Server

With SQL Server 2005, SQL Server introduced some of SQL's window functions, that apply, not to the full set, but a partitioned 'window'. Although the ROW_NUMBER, RANK, NTILE and DENSE_RANK bring great power to TSQL, the full versatility will not be available until SQL Server delivers the full implementation. As usual, Robert Sheldon explains all.

With the release of SQL Server 2005, Transact-SQL included limited support for a subset of set functions referred to as window functions. A set function is one that can be applied to a set of rows. By extension, a window function is one that can be applied to a partitioned set of rows (known as a window) in order to rank or aggregate values in that partition. Window functions have nothing to do with Microsoft Windows, but they do provide a way to work with data grouped into logical windows.

You implement window functions as part of a query’s SELECT expression. The window function can be a ranking function (described below) or aggregate function such as SUM or AVG. The function is then followed by an OVER clause that determines how data is partitioned or ordered. The way in which the OVER clause is implemented depends in part on whether you’re using a ranking window function or aggregate window function. The following syntax shows how both types of functions-along with their supporting OVER clause-are implemented in SQL Server:

As the syntax indicates, the OVER clause, when used with a ranking function, includes an optional PARTITION BY clause and a required ORDER BY clause. However, when used with an aggregate function, the OVER clause must include a PARTITION BY clause but cannot include an ORDER BY clause. The components that make up each type of clause and the differences between them will become clearer as we work through the article.

To that end, I’ve created several examples based on the AdventureWorks2008 database (implemented on a local instance of SQL Server 2008). If you want to run the examples against the AdventureWorks database in SQL Server 2005, be sure to change the database name wherever it’s mentioned in the code samples. In addition, you must change the name of the BusinessEntityID column to SalesPersonID where applicable.

Working with Ranking Window Functions

As the name suggests, ranking functions let you rank the rows in your result set based on specified values in those rows. SQL Server supports four ranking functions:

  • ROW_NUMBER: Assigns a sequential number to each row in the result set.
  • RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. However, the next number in the ranking sequence is skipped.
  • DENSE_RANK: Ranks each row in the result set. If values in the ranking column are the same, they receive the same rank. The next number in the ranking sequence is then used to rank the row or rows that follow.
  • NTILE: Divides the result set into the number of groups specified as an argument to the function. A group number is then assigned to each row identifying which group the row belongs to.

All ranking functions begin with the number 1 when assigning values and rank the data based on the column specified in the mandatory ORDER BY clause. As you saw in the syntax above, for each ranking function, you must include an OVER clause that contains the ORDER BY clause. In the ORDER BY clause, you identify the column on which to base the ranking. You also specify whether the rows should be sorted in ascending or descending order.

In the following SELECT statement, I use all four ranking functions to rank data based on the SalesLastYear column:

Notice that each ranking function is used as part of a column expression in the SELECT list and that only the NTILE function takes an argument (4), which in this case identifies the number of groups into which to divide the result set.

Following the function is the OVER clause. In each case, the OVER clause includes an ORDER BY clause that specifies the SalesLastYear column. In addition, the clause indicates that the sorting order is ascending (ASC). If you want the rows to be sorted in descending order, you must use DESC. The following table shows the results returned by the SELECT statement:

SalesID

FullName

SalesLastYear

RowNumber

SalesRank

DenseRank

NTileRank

274

Stephen Jiang

0.00

1

1

1

1

284

Tete Mensa-Annan

0.00

2

1

1

1

285

Syed Abbas

0.00

3

1

1

1

287

Amy Alberts

0.00

4

1

1

1

288

Rachel Valdez

1307949.7917

5

5

2

1

283

David Campbell

1371635.3158

6

6

3

2

276

Linda Mitchell

1439156.0291

7

7

4

2

278

Garrett Vargas

1620276.8966

8

8

5

2

289

Jae Pak

1635823.3967

9

9

6

2

275

Michael Blythe

1750406.4785

10

10

7

3

279

Tsvi Reiter

1849640.9418

11

11

8

3

280

Pamela Ansman-Wolfe

1927059.178

12

12

9

3

277

Jillian Carson

1997186.2037

13

13

10

3

282

José Saraiva

2038234.6549

14

14

11

4

281

Shu Ito

2073505.9999

15

15

12

4

286

Lynn Tsoflias

2278548.9776

16

16

13

4

290

Ranjit Varkey Chudukatil

2396539.7601

17

17

14

4

Each function ranks the results based on the values in SalesLastYear column (sorted in ascending order). The ROW_NUMBER function assigns a sequential number to each row according to that order. Because the result set returns 17 rows, the function assigns the values 1 through 17 to those rows, starting with the first SalesLastYear value (0.00).

The RANK function also starts with the number 1. However, because the first four rows are the same value (0.00), each row is ranked the same. The next value in SalesLastYear (1307949.7917) is located in the fifth row of the result set, so it receives a rank of 5. The function skips 2 through 4 because 1 is used four times. In addition, because none of the remaining values are repeated, the rest of the result set is ranked sequentially from 6 through 17.

The DENSE_RANK function, like the RANK function, also assigns the number 1 to the first four rows. However, on the fifth row, the function assigns a number 2, rather than skipping 5, and then ranks the rest of the result set sequentially. Consequently, the function assigns only the numbers 1 through 14 because 1 is repeated four times.

The NTILE function works differently from the other functions. Because 4 is specified as the argument to the function, the result set is divided into four groups. That division is based on the total number of rows divided by the number in the argument. That means, in this case, the first group includes five rows and the other three groups include four. The numbers 1 through 4 are assigned to the rows based on this division.

As you’ll recall from the syntax, the OVER clause can also include a PARTITION BY clause, which identifies how to partition the data into windows. In the following example, I include a PARTITION BY clause for each rank function in order to partition the data based on the values in the TerritoryGroup column:

This example is nearly identical to the preceding example except for the addition of the PARTITION BY clause to each OVER clause. (I also changed the NTILE argument from 4 to 2.) The result set is now grouped according to values in the TerritoryGroup column. The ranking functions are then applied to each group, as shown in the following example:

SalesID

TerritoryGroup

SalesLastYear

RowNumber

SalesRank

DenseRank

NTileRank

274

NULL

0.00

1

1

1

1

285

NULL

0.00

2

1

1

1

287

NULL

0.00

3

1

1

2

288

Europe

1307949.7917

1

1

1

1

289

Europe

1635823.3967

2

2

2

1

290

Europe

2396539.7601

3

3

3

2

284

North America

0.00

1

1

1

1

283

North America

1371635.3158

2

2

2

1

276

North America

1439156.0291

3

3

3

1

278

North America

1620276.8966

4

4

4

1

275

North America

1750406.4785

5

5

5

1

279

North America

1849640.9418

6

6

6

2

280

North America

1927059.178

7

7

7

2

277

North America

1997186.2037

8

8

8

2

282

North America

2038234.6549

9

9

9

2

281

North America

2073505.9999

10

10

10

2

286

Pacific

2278548.9776

1

1

1

1

Because the TerritoryGroup column contains NULL values, those values form their own group, and the ranking functions are applied to that group. So ROW_NUMER assigns the values of 1 through 3 to the NULL rows, RANK and DENSE_RANK assign a value of 1 to each of the three rows (because the SalesLastYear values are the same), and NTILE splits the three rows into two groups, with two rows in the first group and one row in the second. This process is then repeated for each group, regardless of the number of rows.

Working with Aggregate Window Functions

In addition to applying ranking functions to partitioned data, you can also apply aggregate functions. The primary difference between the two-in terms of how you implement the functions-is that you cannot use the ORDER BY clause in the OVER clause associated with an aggregate function. A more important difference, however, is in the result set. An aggregate function is applied to all rows within the partition, rather than individual rows, with regard to calculating the aggregated data.

Let’s look at an example to demonstrate how this works. In the following SELECT statement, I partition the result set by the TerritoryName column and then apply the COUNT, SUM, and AVG aggregate functions to the SalesLastYear values in each partition:

As you can see, I pass the SalesLastYear column in as an argument to the aggregate functions. The functions will then be applied to each partition, as shown in the following results:

TerritoryGroup

TerritoryName

SalesLastYear

SalesCnt

SalesTtl

SalesAvg

Pacific

Australia

2278548.9776

1

2278548.9776

2278548.9776

North America

Canada

1620276.8966

2

3658511.5515

1829255.7757

North America

Canada

2038234.6549

2

3658511.5515

1829255.7757

North America

Central

1997186.2037

1

1997186.2037

1997186.2037

Europe

France

2396539.7601

1

2396539.7601

2396539.7601

Europe

Germany

1307949.7917

1

1307949.7917

1307949.7917

North America

Northeast

1750406.4785

1

1750406.4785

1750406.4785

North America

Northwest

1927059.178

3

3298694.4938

1099564.8312

North America

Northwest

1371635.3158

3

3298694.4938

1099564.8312

North America

Northwest

0.00

3

3298694.4938

1099564.8312

North America

Southeast

1849640.9418

1

1849640.9418

1849640.9418

North America

Southwest

1439156.0291

2

3512662.029

1756331.0145

North America

Southwest

2073505.9999

2

3512662.029

1756331.0145

Europe

United Kingdom

1635823.3967

1

1635823.3967

1635823.3967

The result set is, you’ll have noticed, grouped by the TerritoryName values. The aggregate functions are then applied to each row of the group, based on all the rows in the partition. For example, the Northwest territory group includes three rows. The SalesLastYear values in those rows range from 0.00 to 1927059.178. Not surprisingly, the COUNT function returns a value of 3. Notice, however, that this value appears in all three rows. The same is true for the SUM and AVG functions (3298694.4938 and 1099564.8312, respectively), with regard to the values being repeated in each row.

Although this is a lot of repetitive data, the structure of the SELECT statement itself is still much simpler than it would be if you could not take advantage of the aggregate window functions. However, I think the real strength in using window aggregate functions is when you include one of the functions in a larger column expression.

For instance, in the following example I’ve included an expression in the SELECT list that divides the SalesLastYear value by the total sales for each territory:

Notice that I use the SUM aggregate function to calculate SalesLastYear values based on the TerritoryName partitions. That way, I can determine what percentage of a territory’s sales an individual salesperson generates, as shown in the following results:

FullName

TerritoryName

SalesLastYear

SalesCnt

SalesTtl

SalesAvg

SalesPct

Lynn Tsoflias

Australia

2278548.9776

1

2278548.9776

2278548.9776

1.00

Garrett Vargas

Canada

1620276.8966

2

3658511.5515

1829255.7757

0.4428

José Saraiva

Canada

2038234.6549

2

3658511.5515

1829255.7757

0.5571

Jillian Carson

Central

1997186.2037

1

1997186.2037

1997186.2037

1.00

Ranjit Varkey Chudukatil

France

2396539.7601

1

2396539.7601

2396539.7601

1.00

Rachel Valdez

Germany

1307949.7917

1

1307949.7917

1307949.7917

1.00

Michael Blythe

Northeast

1750406.4785

1

1750406.4785

1750406.4785

1.00

Pamela Ansman-Wolfe

Northwest

1927059.178

3

3298694.4938

1099564.8312

0.5841

David Campbell

Northwest

1371635.3158

3

3298694.4938

1099564.8312

0.4158

Tete Mensa-Annan

Northwest

0.00

3

3298694.4938

1099564.8312

0.00

Tsvi Reiter

Southeast

1849640.9418

1

1849640.9418

1849640.9418

1.00

Linda Mitchell

Southwest

1439156.0291

2

3512662.029

1756331.0145

0.4097

Shu Ito

Southwest

2073505.9999

2

3512662.029

1756331.0145

0.5902

Jae Pak

United Kingdom

1635823.3967

1

1635823.3967

1635823.3967

1.00

As you can see, the results include a percentage of sales for each salesperson. For example, in the Southwest group, Linda Mitchell generated about 41% of the sales, and Shu Ito about 59%. On the other hand, Lynn Tsoflias in the Australia group generated 100% of the sales because she’s the only one in the group. Although the result set also shows the COUNT, SUM, and AVG totals, I’ve included them only to verify the data. Ultimately, you can create your statements to include only the calculated data you need.

Moving Forward

At the beginning of the article, I mentioned that SQL Server includes only limited support for window functions, which in this case are the ranking and aggregate functions. However, the window functions feature set, as defined in the ANSI SQL standards, is far more extensive than what SQL Server currently supports. Here’s a sample of the type of functionality that Microsoft has yet to implement:

  • Support for the ORDER BY clause when working with aggregate functions
  • Implementation of the DISTINCT clause when working with aggregate functions
  • Implementation of the ROWS and RANGE framing functions to help define the window frame type
  • Implementation of the LAG and LEAD offset functions to compare values based on relative positions
  • Implementation of the FIRST_VALUE and LAST_VALUE offset functions to work with ranked rows
  • Implementation of the WINDOW clause to reuse the window definition

A number of SQL Server developers have requested that Microsoft include these and other window function enhancements in their next release of SQL Server. For many, extending this feature set is one of their top priorities for the next implementation of Transact-SQL. Until then, we must be satisfied with the limited functionality that’s currently supported, which includes the ability to rank and aggregate partitioned data. That said, the features I’ve shown you can be useful tools when you need to return this type of data. For more information on how to implement window functions, particularly the OVER clause, check out the topic “OVER Clause (Transact-SQL)” in SQL Server Books Online.