DAX Statistical Functions

Following on from his first four articles on using Data Analysis Expressions (DAX) with tabular databases, Robert Sheldon dives into some of the DAX statistical functions available, demonstrating which are the most useful and examples of how they work.

The Data Analysis Expressions (DAX) language includes a wide range of functions that help you refine your queries when retrieving data from a SQL Server Analysis Services (SSAS) tabular database. You’ve seen numerous examples of DAX functions throughout this series on the tabular model. One set of functions that are particularly useful when working with tabular data are the statistical functions, which support various ways to aggregate and analyze data. Statistical functions include not only the typical standbys, such as Min and Max, but also those that help structure that data into tables, such as AddColumns and Summarize.

In this article, the fifth in our series, we explore many of the DAX statistical functions and provide numerous examples that demonstrate how they work. As with the previous articles (one | two | three | four), the examples retrieve data from the AdventureWorks Tabular Model SQL 2012 tabular database, available as a SQL Server Data Tools (SSDT) tabular project from the AdventureWorks CodePlex site. I implemented the database on a local instance of SSAS 2012 in tabular mode and developed the examples against that database.

To follow along in this article, you should have a basic understanding of how to write a DAX query. The first two articles provide a good starting point if you’re new to DAX and SSAS tabular databases. Once you’ve read through those, you should have no problem applying the statistical functions we cover here.

Given the analytical nature of tabular databases, it’s not surprising that the statistical functions play such a key role; learning them is well worth the effort.

The Row Function

Often when retrieving and aggregating tabular data, you’ll want to return only a single row in order to get a quick overview of your data. That’s where the Row function comes in. It returns a single-row table made up of one or more columns. You can use the function in conjunction with other functions in order to view different types of information. For example, the following query uses the Sum and Format functions in conjunction with the Row function in order to return the total amount of sales from the Sales Amount column in the Internet Sales table:

As you’ve seen in the previous articles, you must use the Evaluate statement to retrieve data from an SSAS tabular database. The statement takes one argument: a table or table expression. In this example, we use the Row function to create the table expression.

For each column you want the Row function to return, you must specify the column name in double quotes followed by an expression that defines the value to be returned, with a comma separating the two components. In the example above, the Row function returns only one column named Total Sales, and the column expression uses the Sum function to return the sales total from Sales Amount. The expression then uses the Format function to return the value in a standard currency format (in US dollars). The following table shows the results returned by the Evaluate statement:

Total Sales

$29,358,677.22

If you want to return more than one column, when using the Row function, you must include a name/expression pair for each one, separated by a comma, as shown in the following example:

Each column definition follows the same structure: column name followed by column expression. In this case, the column definitions differ only in the column name and the statistical function being applied. The Average function returns the average sales amount, the Max function return the highest sales amount, and the Min function returns the lowest. As the following results show, the Evaluate statement now returns four columns, but still one row:

Total Sales Average Sales Highest Sale Lowest Sale
$29,358,677.22 $486.09 $3,578.27 $2.29

When using a statistical function such as Average, it can also be useful to know how many rows are being evaluated. The following example includes both the Count and DistinctCount statistical functions:

The Count function returns the total number of values in the specified column, and the DistinctCount function returns the number of unique values in that column. The Count function does not work for string values, only number and date columns. Because we’re using the function on the Sales Amount column, it works fine because the column contains numerical data only. The following table shows the results now returned by the Evaluate statement:

Total Sales Average Sales Count Distinct Count
$29,358,677.22 $486.09 60398 42

When working with the Count function, be aware that it counts only existing values. If a row is missing a value, that row is not counted. To get a count of all the rows in a table, you must use the CountRows statistical function, as shown in the following example:

In this case, the StandardCost column in the Product table contains a number of empty cells, so a count based on that column returns a number less than the number of rows in the table, but the CountRows functions includes all rows, as the following results show:

Average Cost Count Distinct Count Row Count
$434.27 395 135 606

As you’ve seen, the Row function can be useful whenever you need to return only a single row, which is often the case when aggregating data. However, the other statistical functions we’ve looked at are by no means limited to the Row function. For example, you can use them with the Summarize function when grouping data. Nevertheless, more often than not you’ll find that if you use the Row function, you’ll be using the aggregate-type statistical functions as well.

The Topn Function

The Topn function limits the rows returned by a query to the top number of rows specified when you call the function, similar to how a TOP expression works in T-SQL. When you use the Topn function, you must include three parameters. The first specifies the number of rows to return. The second is the table or table expression from which you’re extracting the top rows. And the third is an expression that specifies the order in which the rows in the table should be sorted before returning the top rows. Usually this is one or more columns in the target table.

For example, suppose you want to return the first five rows of the Sales Territory table, with the data sorted by the SalesTerritoryKey column. The following example shows how to call the Topn function and pass in the parameter values:

Once again, we start with an Evaluate statement and specify the Topn function as the statement’s table expression. Into the function, we pass in the three required parameter values, which specify that the top five rows should be returned from the Sales Territory table, sorted by the SalesTerritoryKey column. The Evaluate statement returns the following data:

SalesTerritoryKey

Sales Territory Region

Sales Territory Country

Sales Territory Group

Sales Territory Image

7

France

France

Europe

System.Byte[]

9

Australia

Australia

Pacific

System.Byte[]

8

Germany

Germany

Europe

System.Byte[]

10

United Kingdom

United Kingdom

Europe

System.Byte[]

11

NA

NA

NA

System.Byte[]

One thing that might seem odd about the results is that they return rows with the SalesTerritoryKey values of 7 through 11. If you’re familiar with the data, you know that the first value is 1. We’ll get into more details about this in a bit, but for now, know that the Topn function, by default, sorts the data in descending order, unlike most sort operations you’ll come across.

Another odd aspect of these results is that they’re not ordered according to the SalesTerritoryKey column, as you might expect. Although you have to specify the order when calling the Topn function, that order applies only when determining which rows to return, not when ordering the result set. For that, you need to add an OrderBy clause that specifies the SalesTerritoryKey column, as shown in the following example:

Now our statement will return the data in the specified order, as the following results show:

SalesTerritoryKey

Sales Territory Region

Sales Territory Country

Sales Territory Group

Sales Territory Image

7

France

France

Europe

System.Byte[]

8

Germany

Germany

Europe

System.Byte[]

9

Australia

Australia

Pacific

System.Byte[]

10

United Kingdom

United Kingdom

Europe

System.Byte[]

11

NA

NA

NA

System.Byte[]

The Topn function actually supports a fourth parameter, which determines whether the rows are sorted in ascending or descending order. As mentioned above, by default, they’re sorted in descending order. However, you can specify ascending or descending my adding a 0 for descending or a 1 for ascending. For example, the following Evaluate statement returns the top five rows in ascending order:

The only difference between this example and the previous one is the addition of the fourth parameter for the Topn function. Now the statement returns the first five rows in the table, as shown in the following results:

SalesTerritoryKey

Sales Territory Region

Sales Territory Country

Sales Territory Group

Sales Territory Image

1

Northwest

United States

North America

System.Byte[]

2

Northeast

United States

North America

System.Byte[]

3

Central

United States

North America

System.Byte[]

4

Southwest

United States

North America

System.Byte[]

5

Southeast

United States

North America

System.Byte[]

You can also use the Topn function in conjunction with other functions. For example, the following Evaluate statement returns the top five rows from the table returned by the Summarize function:

In the previous examples, I specified the SalesTerritory table as the second parameter of the Topn function, but this time around I’m using the Summarize function as the table expression. The function returns the Product Name and Calendar Year columns, along with a calculated column that provides a sales total for each group of data. I then use the Topn function to return only the first five rows of the data returned by the Summarize function. Notice that I specify the calculated column (SalesAmount) as the sort column, in descending order. The following table shows the results returned by the Evaluate statement:

Product Name

Calendar Year

Sales Amount

Mountain-100 Silver, 38

2006

$98,599.71

Mountain-100 Silver, 38

2005

$98,599.71

Mountain-100 Black, 44

2005

$97,874.71

Mountain-100 Black, 38

2006

$94,499.72

Mountain-100 Black, 42

2006

$91,124.73

As you can see, the data now includes only the three columns specified in the Summarize function, with the data sorted in descending order, based on the SalesAmount column.

The Rollup Function

You’ve already gotten a taste of the Rollup function in the second article of this series. This time around, however, we’ll go into the function a little deeper and look at a couple other functions associated with Rollup.

The Rollup function is used in conjunction with the Summarize function. The Summarize function groups data together by the columns specified when you call the function. The function then returns a table that can be used by other functions, wherever a parameter requires a table value. You can also use the function as the table expression in an Evaluate statement.

The Summarize function supports the use of the Rollup function as part of the Summarize definition. The Rollup function adds roll-up rows to the returned table based on the columns used to group the data. The roll-up rows provide an additional layer of aggregation, above what you’ve already defined in your Summarize function. Let’s look at example to better understand how this works. In the following Evaluate statement, we use the Summarize function as the statement’s table expression:

The first parameter in the Summarize function requires its own table or table expression as its value. In this case, I’ve used a CalculateTable function to create that expression. The function returns the Internet Sales table, but limits the rows returned to those with a Product Category Name value equal to Accessories.

The second parameter value in the Summarize function specifies the Product Name column, which means the data will be grouped based on the values in that column.

The third value specifies the Calendar Year column. As a result, each product will be grouped by the year it was sold. However, I’ve also included the Rollup function in this expression. That means an additional row will be returned for each product that rolls up the data into a total for all years.

The last parameter value in the Summarize function is a calculated column named Total Sales. The column uses the Sum aggregate function to total the Sales Amount values for each grouping. I’ve also used the Format function to return the data in a currency format. The following table shows the results returned by the Evaluate statement:

Product Name

Calendar Year

Total Sales

All-Purpose Bike Stand

$39,591.00

All-Purpose Bike Stand

2007

$18,921.00

All-Purpose Bike Stand

2008

$20,670.00

Bike Wash – Dissolver

$7,218.60

Bike Wash – Dissolver

2007

$3,044.85

Bike Wash – Dissolver

2008

$4,173.75

Fender Set – Mountain

$46,619.58

Fender Set – Mountain

2007

$19,408.34

Fender Set – Mountain

2008

$27,211.24

Hitch Rack – 4-Bike

$39,360.00

Hitch Rack – 4-Bike

2007

$16,440.00

Hitch Rack – 4-Bike

2008

$22,920.00

HL Mountain Tire

$48,860.00

HL Mountain Tire

2007

$20,300.00

HL Mountain Tire

2008

$28,560.00

HL Road Tire

$27,970.80

HL Road Tire

2007

$12,877.00

HL Road Tire

2008

$15,093.80

Hydration Pack – 70 oz.

$40,307.67

Hydration Pack – 70 oz.

2007

$16,771.95

Hydration Pack – 70 oz.

2008

$23,535.72

As with the previous examples, the results shown here are only part of the returned data, but you can see that totals are returned for each product sold and year it was sold. You can also see the roll-up row for each product-the rows without a Calendar Year value.

There might be times when you want to mark in some way when a row is a roll-up, other than just depending on blank values in the roll-up column. DAX includes the IsSubtotal function that you can use in conjunction with the Rollup function to create a column that indicates when a row is a roll-up. The IsSubtotal function returns a value of True if the row is a roll-up. Otherwise, the function returns False. The following example demonstrates how to use the IsSubtotal function by creating an additional column named IsRollup:

The Evaluate statement is the same as the one shown in the preceding example, except for the addition of the Is Rollup column. Notice that the expression that defines the column includes the IsSubtotal function. When calling the function, you need to include the name of the column specified in the Rollup function, in this case the Calendar Year column. The Evaluate statement returns the following results:

Product Name

Calendar Year

Total Sales

Is Rollup

All-Purpose Bike Stand

$39,591.00

True

All-Purpose Bike Stand

2007

$18,921.00

False

All-Purpose Bike Stand

2008

$20,670.00

False

Bike Wash – Dissolver

$7,218.60

True

Bike Wash – Dissolver

2007

$3,044.85

False

Bike Wash – Dissolver

2008

$4,173.75

False

Fender Set – Mountain

$46,619.58

True

Fender Set – Mountain

2007

$19,408.34

False

Fender Set – Mountain

2008

$27,211.24

False

Hitch Rack – 4-Bike

$39,360.00

True

Hitch Rack – 4-Bike

2007

$16,440.00

False

Hitch Rack – 4-Bike

2008

$22,920.00

False

HL Mountain Tire

$48,860.00

True

HL Mountain Tire

2007

$20,300.00

False

HL Mountain Tire

2008

$28,560.00

False

HL Road Tire

$27,970.80

True

HL Road Tire

2007

$12,877.00

False

HL Road Tire

2008

$15,093.80

False

Hydration Pack – 70 oz.

$40,307.67

True

Hydration Pack – 70 oz.

2007

$16,771.95

False

Hydration Pack – 70 oz.

2008

$23,535.72

False

The result set now includes the IsRollup column and is populated with True and False values, with each roll-up row receiving a value of True.

Now suppose that instead of grouping our data by product and then year sold, we group it by product category, then product subcategory, and finally the year sold, as shown in the following example:

As in the previous examples, we’re rolling up our data based on the CalendarYear column. Consequently, the statement will provide a roll-up row for each category/subcategory grouping, as shown in the following results:

Product Category Name

Product Subcategory Name

Calendar Year

Total Sales

Is Rollup

Accessories

Bike Racks

$39,360.00

True

Accessories

Bike Racks

2007

$16,440.00

False

Accessories

Bike Racks

2008

$22,920.00

False

Accessories

Bike Stands

$39,591.00

True

Accessories

Bike Stands

2007

$18,921.00

False

Accessories

Bike Stands

2008

$20,670.00

False

Accessories

Bottles and Cages

$56,798.19

True

Accessories

Bottles and Cages

2007

$23,280.27

False

Accessories

Bottles and Cages

2008

$33,517.92

False

Accessories

Cleaners

$7,218.60

True

Accessories

Cleaners

2007

$3,044.85

False

Accessories

Cleaners

2008

$4,173.75

False

Accessories

Fenders

$46,619.58

True

Accessories

Fenders

2007

$19,408.34

False

Accessories

Fenders

2008

$27,211.24

False

Accessories

Helmets

$225,335.60

True

Accessories

Helmets

2007

$92,583.54

False

Accessories

Helmets

2008

$132,752.06

False

Accessories

Hydration Packs

$40,307.67

True

Accessories

Hydration Packs

2007

$16,771.95

False

Accessories

Hydration Packs

2008

$23,535.72

False

Accessories

Tires and Tubes

$245,529.32

True

Accessories

Tires and Tubes

2007

$103,259.76

False

Accessories

Tires and Tubes

2008

$142,269.56

False

Bikes

Mountain Bikes

$9,952,759.56

True

Bikes

Mountain Bikes

2005

$585,973.27

False

Bikes

Mountain Bikes

2006

$1,562,456.76

False

Bikes

Mountain Bikes

2007

$3,989,638.48

False

Bikes

Mountain Bikes

2008

$3,814,691.06

False

This level of roll-up is probably fine for some of your needs, but at times, you might find it handy to return a roll-up row for each category, as well as each category/subcategory grouping. The easiest way to do that is to add the ProductSubcategoryName column to the Rollup function, as shown in the following example:

When adding an additional column to the Rollup function, you must separate the column names with a comma, but that’s all you’ll need to do. As the following table shows, the results now include a roll-up row for each product category:

Product Category Name

Product Subcategory Name

Calendar Year

Total Sales

Is Rollup

Accessories

$700,759.96

True

Accessories

Bike Racks

$39,360.00

True

Accessories

Bike Racks

2007

$16,440.00

False

Accessories

Bike Racks

2008

$22,920.00

False

Accessories

Bike Stands

$39,591.00

True

Accessories

Bike Stands

2007

$18,921.00

False

Accessories

Bike Stands

2008

$20,670.00

False

Accessories

Bottles and Cages

$56,798.19

True

Accessories

Bottles and Cages

2007

$23,280.27

False

Accessories

Bottles and Cages

2008

$33,517.92

False

Accessories

Cleaners

$7,218.60

True

Accessories

Cleaners

2007

$3,044.85

False

Accessories

Cleaners

2008

$4,173.75

False

Accessories

Fenders

$46,619.58

True

Accessories

Fenders

2007

$19,408.34

False

Accessories

Fenders

2008

$27,211.24

False

Accessories

Helmets

$225,335.60

True

Accessories

Helmets

2007

$92,583.54

False

Accessories

Helmets

2008

$132,752.06

False

Accessories

Hydration Packs

$40,307.67

True

Accessories

Hydration Packs

2007

$16,771.95

False

Accessories

Hydration Packs

2008

$23,535.72

False

Accessories

Tires and Tubes

$245,529.32

True

Accessories

Tires and Tubes

2007

$103,259.76

False

Accessories

Tires and Tubes

2008

$142,269.56

False

Bikes

$28,318,144.65

True

Bikes

Mountain Bikes

$9,952,759.56

True

Bikes

Mountain Bikes

2005

$585,973.27

False

Bikes

Mountain Bikes

2006

$1,562,456.76

False

Bikes

Mountain Bikes

2007

$3,989,638.48

False

Bikes

Mountain Bikes

2008

$3,814,691.06

False

The Summarize function supports one other related function worth mentioning: RollupGroup. When used in conjunction with the Rollup function, the RollupGroup function prevents partial subtotals from being included in the result set. Let’s look at another example to demonstrate how this works. In the following Evaluate statement, I’ve added the RollupGroup function as part of the Rollup function:

The column names are passed into the RollupGroup function, and that function is passed to the Rollup function. As to be expected, the statement now returns slightly different results:

Product Category Name

Product Subcategory Name

Calendar Year

Total Sales

Is Rollup

Accessories

$700,759.96

True

Accessories

Bike Racks

2007

$16,440.00

False

Accessories

Bike Racks

2008

$22,920.00

False

Accessories

Bike Stands

2007

$18,921.00

False

Accessories

Bike Stands

2008

$20,670.00

False

Accessories

Bottles and Cages

2007

$23,280.27

False

Accessories

Bottles and Cages

2008

$33,517.92

False

Accessories

Cleaners

2007

$3,044.85

False

Accessories

Cleaners

2008

$4,173.75

False

Accessories

Fenders

2007

$19,408.34

False

Accessories

Fenders

2008

$27,211.24

False

Accessories

Helmets

2007

$92,583.54

False

Accessories

Helmets

2008

$132,752.06

False

Accessories

Hydration Packs

2007

$16,771.95

False

Accessories

Hydration Packs

2008

$23,535.72

False

Accessories

Tires and Tubes

2007

$103,259.76

False

Accessories

Tires and Tubes

2008

$142,269.56

False

Bikes

$28,318,144.65

True

Bikes

Mountain Bikes

2005

$585,973.27

False

Bikes

Mountain Bikes

2006

$1,562,456.76

False

Bikes

Mountain Bikes

2007

$3,989,638.48

False

Bikes

Mountain Bikes

2008

$3,814,691.06

False

A roll-up row is included for each category, but not for the subcategories. As you can see, DAX provides a number of ways to roll up your data so your results include exactly the information you need.

The Crossjoin Function

The last function that we’ll look at is Crossjoin. You use a Crossjoin function to join two or more tables in order to return a Cartesian product, a result set in which each row in each table is joined with each row in each other table. The result set is the product of the number of rows from each table. For example, if one table contains five rows and a second table contains seven rows, the Cartesian product for those two tables will contain 35 rows.

This will all become clearer as we work through our examples. But first, take a look at the following Evaluate statement, in which we retrieve data from the Internet Sales table:

The statement contains nothing that you haven’t seen before. It groups data based on the ProductCategory and CalendarYear columns and provides the sales totals for each group, as shown in the following results:

Product Category Name

Calendar Year

Total Sales

Accessories

2007

$293,709.71

Accessories

2008

$407,050.25

Bikes

2005

$3,266,373.66

Bikes

2006

$6,530,343.53

Bikes

2007

$9,359,102.62

Bikes

2008

$9,162,324.85

Clothing

2007

$138,247.97

Clothing

2008

$201,524.64

The Evaluate statement returns just what we’d expect. However, what’s important about these results are not what they include, but what is missing. For example, the Product Category column in the source table includes the value Components, but that value isn’t included in the results because no sales are associated with that category. In addition, years are displayed only if they have sales associated with them. For instance, the results include information about the Accessories category for 2007 and 2008, but no other years are listed.

Often, we’ll want our results to be more complete than what we’re able to generate with a simple Summarize function and for that we need the Crossjoin function. It returns a Cartesian product, and we can use that data to generate more complete results.

Let’s look at an example to give you an idea of how the Crossjoin function works. In the following Evaluate statement, I use the Crossjoin function as the statement’s table expression:

I pass two arguments into the Crossjoin function. Each argument must return a table. In this case, I use the Values function for each argument in order to return a one-column table of distinct values. The first Values function returns a list of values from the Product Category Name column, and the second Values function returns a list of values from the Calendar Year column. (These are the same two columns used to group data in the previous example.) The Evaluate statement returns the results shown in the following table:

Product Category Name

Calendar Year

2005

2006

2007

2008

2009

2010

Accessories

2005

Accessories

2006

Accessories

2007

Accessories

2008

Accessories

2009

Accessories

2010

Bikes

2005

Bikes

2006

Bikes

2007

Bikes

2008

Bikes

2009

Bikes

2010

Clothing

2005

Clothing

2006

Clothing

2007

Clothing

2008

Clothing

2009

Clothing

2010

Components

2005

Components

2006

Components

2007

Components

2008

Components

2009

Components

2010

Each row in each table has been joined with each row in the other table. For example, the Accessories value in the Product Category Name column is matched with each year in the Calendar Year column. The reason that there are blank values in the Product Category Name column is because the source data-the Product Category table in the tabular database-includes a blank member, and that blank member is matched with each year, just like the other Product Category Name values. Because the Product Category column contains five distinct values, and the Calendar Year column contains six distinct values, our query returns 30 rows.

Now that we know how to use the Crossjoin function to produce the Cartesian product, we can include it in our Evaluate statement to return more meaningful data. The following example uses the Crossjoin function as the table expression within the Summarize function:

The Summarize function is now working with a Cartesian product as its table. We can then group the data based on the ProductCategoryName column and the CalendarYear column, both of which are including in the Cartesian product. Now the Evaluate statement returns the following results:

Product Category Name

Calendar Year

Total Sales

2005

2006

2007

2008

2009

2010

Accessories

2005

Accessories

2006

Accessories

2007

$293,709.71

Accessories

2008

$407,050.25

Accessories

2009

Accessories

2010

Bikes

2005

$3,266,373.66

Bikes

2006

$6,530,343.53

Bikes

2007

$9,359,102.62

Bikes

2008

$9,162,324.85

Bikes

2009

Bikes

2010

Clothing

2005

Clothing

2006

Clothing

2007

$138,247.97

Clothing

2008

$201,524.64

Clothing

2009

Clothing

2010

Components

2005

Components

2006

Components

2007

Components

2008

Components

2009

Components

2010

As you can see, all the data is included in the results, whether or not any sales were made for a particular category in a particular year. This way, you know you’re returning the complete results, which can often give you a better picture of the data-and point to where potential problems might exist.

DAX Statistical Functions

You should now have a good sense of how to use the Row, Topn, Rollup, and Crossjoin functions, along with several of the aggregate statistical functions. Once you know the basics of how to use them, you should be able to apply them to other situations. DAX also supports a number of other statistical functions, as well as additional functions types, such as date and time functions, filter functions, and logical functions. Be sure to check out the TechNet article “DAX Function Reference” for information about those other functions as well as additional details about the functions we’ve covered.