DAX Statistical Functions

Share to social media

The Data Analysis Expressions (DAX) language includes a wide range of functions that help you refine your queries when retrieving data from a SQL Server Analysis Services (SSAS) tabular database. You’ve seen numerous examples of DAX functions throughout this series on the tabular model. One set of functions that are particularly useful when working with tabular data are the statistical functions, which support various ways to aggregate and analyze data. Statistical functions include not only the typical standbys, such as `Min` and `Max`, but also those that help structure that data into tables, such as `AddColumns` and `Summarize`.

In this article, the fifth in our series, we explore many of the DAX statistical functions and provide numerous examples that demonstrate how they work. As with the previous articles (one | two | three | four), the examples retrieve data from the `AdventureWorks Tabular Model SQL 2012` tabular database, available as a SQL Server Data Tools (SSDT) tabular project from the AdventureWorks CodePlex site. I implemented the database on a local instance of SSAS 2012 in tabular mode and developed the examples against that database.

To follow along in this article, you should have a basic understanding of how to write a DAX query. The first two articles provide a good starting point if you’re new to DAX and SSAS tabular databases. Once you’ve read through those, you should have no problem applying the statistical functions we cover here.

Given the analytical nature of tabular databases, it’s not surprising that the statistical functions play such a key role; learning them is well worth the effort.

The Row Function

Often when retrieving and aggregating tabular data, you’ll want to return only a single row in order to get a quick overview of your data. That’s where the `Row` function comes in. It returns a single-row table made up of one or more columns. You can use the function in conjunction with other functions in order to view different types of information. For example, the following query uses the `Sum` and `Format` functions in conjunction with the `Row` function in order to return the total amount of sales from the `Sales Amount` column in the `Internet Sales` table:

As you’ve seen in the previous articles, you must use the `Evaluate` statement to retrieve data from an SSAS tabular database. The statement takes one argument: a table or table expression. In this example, we use the `Row` function to create the table expression.

For each column you want the `Row` function to return, you must specify the column name in double quotes followed by an expression that defines the value to be returned, with a comma separating the two components. In the example above, the `Row` function returns only one column named `Total Sales`, and the column expression uses the `Sum` function to return the sales total from `Sales Amount`. The expression then uses the `Format` function to return the value in a standard currency format (in US dollars). The following table shows the results returned by the `Evaluate` statement:

 Total Sales \$29,358,677.22

If you want to return more than one column, when using the `Row` function, you must include a name/expression pair for each one, separated by a comma, as shown in the following example:

Each column definition follows the same structure: column name followed by column expression. In this case, the column definitions differ only in the column name and the statistical function being applied. The `Average` function returns the average sales amount, the `Max` function return the highest sales amount, and the `Min` function returns the lowest. As the following results show, the `Evaluate` statement now returns four columns, but still one row:

 Total Sales Average Sales Highest Sale Lowest Sale \$29,358,677.22 \$486.09 \$3,578.27 \$2.29

When using a statistical function such as `Average`, it can also be useful to know how many rows are being evaluated. The following example includes both the `Count` and `DistinctCount` statistical functions:

The `Count` function returns the total number of values in the specified column, and the `DistinctCount` function returns the number of unique values in that column. The `Count` function does not work for string values, only number and date columns. Because we’re using the function on the `Sales Amount` column, it works fine because the column contains numerical data only. The following table shows the results now returned by the `Evaluate` statement:

 Total Sales Average Sales Count Distinct Count \$29,358,677.22 \$486.09 60398 42

When working with the `Count` function, be aware that it counts only existing values. If a row is missing a value, that row is not counted. To get a count of all the rows in a table, you must use the `CountRows` statistical function, as shown in the following example:

In this case, the `Standard``Cost` column in the `Product` table contains a number of empty cells, so a count based on that column returns a number less than the number of rows in the table, but the `CountRows` functions includes all rows, as the following results show:

 Average Cost Count Distinct Count Row Count \$434.27 395 135 606

As you’ve seen, the `Row` function can be useful whenever you need to return only a single row, which is often the case when aggregating data. However, the other statistical functions we’ve looked at are by no means limited to the `Row` function. For example, you can use them with the `Summarize` function when grouping data. Nevertheless, more often than not you’ll find that if you use the `Row` function, you’ll be using the aggregate-type statistical functions as well.

The Topn Function

The `Topn` function limits the rows returned by a query to the top number of rows specified when you call the function, similar to how a `TOP` expression works in T-SQL. When you use the `Topn` function, you must include three parameters. The first specifies the number of rows to return. The second is the table or table expression from which you’re extracting the top rows. And the third is an expression that specifies the order in which the rows in the table should be sorted before returning the top rows. Usually this is one or more columns in the target table.

For example, suppose you want to return the first five rows of the `Sales Territory` table, with the data sorted by the `SalesTerritoryKey` column. The following example shows how to call the `Topn` function and pass in the parameter values:

Once again, we start with an `Evaluate` statement and specify the `Top``n` function as the statement’s table expression. Into the function, we pass in the three required parameter values, which specify that the top five rows should be returned from the `Sales Territory` table, sorted by the `SalesTerritoryKey` column. The `Evaluate` statement returns the following data:

 SalesTerritoryKey Sales Territory Region Sales Territory Country Sales Territory Group Sales Territory Image 7 France France Europe System.Byte[] 9 Australia Australia Pacific System.Byte[] 8 Germany Germany Europe System.Byte[] 10 United Kingdom United Kingdom Europe System.Byte[] 11 NA NA NA System.Byte[]

One thing that might seem odd about the results is that they return rows with the `SalesTerritoryKey` values of `7` through `11`. If you’re familiar with the data, you know that the first value is `1`. We’ll get into more details about this in a bit, but for now, know that the `Topn` function, by default, sorts the data in descending order, unlike most sort operations you’ll come across.

Another odd aspect of these results is that they’re not ordered according to the `SalesTerritoryKey` column, as you might expect. Although you have to specify the order when calling the `Topn` function, that order applies only when determining which rows to return, not when ordering the result set. For that, you need to add an `Order``By` clause that specifies the `SalesTerritoryKey` column, as shown in the following example:

Now our statement will return the data in the specified order, as the following results show:

 SalesTerritoryKey Sales Territory Region Sales Territory Country Sales Territory Group Sales Territory Image 7 France France Europe System.Byte[] 8 Germany Germany Europe System.Byte[] 9 Australia Australia Pacific System.Byte[] 10 United Kingdom United Kingdom Europe System.Byte[] 11 NA NA NA System.Byte[]

The `Topn` function actually supports a fourth parameter, which determines whether the rows are sorted in ascending or descending order. As mentioned above, by default, they’re sorted in descending order. However, you can specify ascending or descending my adding a `0` for descending or a `1` for ascending. For example, the following `Evaluate` statement returns the top five rows in ascending order:

The only difference between this example and the previous one is the addition of the fourth parameter for the `Topn` function. Now the statement returns the first five rows in the table, as shown in the following results:

 SalesTerritoryKey Sales Territory Region Sales Territory Country Sales Territory Group Sales Territory Image 1 Northwest United States North America System.Byte[] 2 Northeast United States North America System.Byte[] 3 Central United States North America System.Byte[] 4 Southwest United States North America System.Byte[] 5 Southeast United States North America System.Byte[]

You can also use the `Topn` function in conjunction with other functions. For example, the following `Evaluate` statement returns the top five rows from the table returned by the `Summarize` function:

In the previous examples, I specified the `Sales``Territory` table as the second parameter of the `Topn` function, but this time around I’m using the `Summarize` function as the table expression. The function returns the `Product Name` and `Calendar Year` columns, along with a calculated column that provides a sales total for each group of data. I then use the `Topn` function to return only the first five rows of the data returned by the `Summarize` function. Notice that I specify the calculated column (`Sales``Amount`) as the sort column, in descending order. The following table shows the results returned by the `Evaluate` statement:

 Product Name Calendar Year Sales Amount Mountain-100 Silver, 38 2006 \$98,599.71 Mountain-100 Silver, 38 2005 \$98,599.71 Mountain-100 Black, 44 2005 \$97,874.71 Mountain-100 Black, 38 2006 \$94,499.72 Mountain-100 Black, 42 2006 \$91,124.73

As you can see, the data now includes only the three columns specified in the `Summarize` function, with the data sorted in descending order, based on the `Sales``Amount` column.

The Rollup Function

You’ve already gotten a taste of the `Rollup` function in the second article of this series. This time around, however, we’ll go into the function a little deeper and look at a couple other functions associated with `Rollup`.

The `Rollup` function is used in conjunction with the `Summarize` function. The `Summarize` function groups data together by the columns specified when you call the function. The function then returns a table that can be used by other functions, wherever a parameter requires a table value. You can also use the function as the table expression in an `Evaluate` statement.

The `Summarize` function supports the use of the `Rollup` function as part of the `Summarize` definition. The `Rollup` function adds roll-up rows to the returned table based on the columns used to group the data. The roll-up rows provide an additional layer of aggregation, above what you’ve already defined in your `Summarize` function. Let’s look at example to better understand how this works. In the following `Evaluate` statement, we use the `Summarize` function as the statement’s table expression:

The first parameter in the `Summarize` function requires its own table or table expression as its value. In this case, I’ve used a `CalculateTable` function to create that expression. The function returns the `Internet Sales` table, but limits the rows returned to those with a `Product Category Name` value equal to `Accessories`.

The second parameter value in the `Summarize` function specifies the `Product Name` column, which means the data will be grouped based on the values in that column.

The third value specifies the `Calendar Year` column. As a result, each product will be grouped by the year it was sold. However, I’ve also included the `Rollup` function in this expression. That means an additional row will be returned for each product that rolls up the data into a total for all years.

The last parameter value in the `Summarize` function is a calculated column named `Total Sales`. The column uses the `Sum` aggregate function to total the `Sales Amount` values for each grouping. I’ve also used the `F``ormat` function to return the data in a currency format. The following table shows the results returned by the `Evaluate` statement:

 Product Name Calendar Year Total Sales All-Purpose Bike Stand \$39,591.00 All-Purpose Bike Stand 2007 \$18,921.00 All-Purpose Bike Stand 2008 \$20,670.00 Bike Wash – Dissolver \$7,218.60 Bike Wash – Dissolver 2007 \$3,044.85 Bike Wash – Dissolver 2008 \$4,173.75 Fender Set – Mountain \$46,619.58 Fender Set – Mountain 2007 \$19,408.34 Fender Set – Mountain 2008 \$27,211.24 Hitch Rack – 4-Bike \$39,360.00 Hitch Rack – 4-Bike 2007 \$16,440.00 Hitch Rack – 4-Bike 2008 \$22,920.00 HL Mountain Tire \$48,860.00 HL Mountain Tire 2007 \$20,300.00 HL Mountain Tire 2008 \$28,560.00 HL Road Tire \$27,970.80 HL Road Tire 2007 \$12,877.00 HL Road Tire 2008 \$15,093.80 Hydration Pack – 70 oz. \$40,307.67 Hydration Pack – 70 oz. 2007 \$16,771.95 Hydration Pack – 70 oz. 2008 \$23,535.72

As with the previous examples, the results shown here are only part of the returned data, but you can see that totals are returned for each product sold and year it was sold. You can also see the roll-up row for each product-the rows without a `Calendar Year` value.

There might be times when you want to mark in some way when a row is a roll-up, other than just depending on blank values in the roll-up column. DAX includes the `IsSubtotal` function that you can use in conjunction with the `Rollup` function to create a column that indicates when a row is a roll-up. The `IsSubtotal` function returns a value of `True` if the row is a roll-up. Otherwise, the function returns `False`. The following example demonstrates how to use the `IsSubtotal` function by creating an additional column named `Is``Rollup`:

The `Evaluate` statement is the same as the one shown in the preceding example, except for the addition of the `Is Rollup` column. Notice that the expression that defines the column includes the `IsSubtotal` function. When calling the function, you need to include the name of the column specified in the `Rollup` function, in this case the `Calendar Year` column. The `Evaluate` statement returns the following results:

 Product Name Calendar Year Total Sales Is Rollup All-Purpose Bike Stand \$39,591.00 True All-Purpose Bike Stand 2007 \$18,921.00 False All-Purpose Bike Stand 2008 \$20,670.00 False Bike Wash – Dissolver \$7,218.60 True Bike Wash – Dissolver 2007 \$3,044.85 False Bike Wash – Dissolver 2008 \$4,173.75 False Fender Set – Mountain \$46,619.58 True Fender Set – Mountain 2007 \$19,408.34 False Fender Set – Mountain 2008 \$27,211.24 False Hitch Rack – 4-Bike \$39,360.00 True Hitch Rack – 4-Bike 2007 \$16,440.00 False Hitch Rack – 4-Bike 2008 \$22,920.00 False HL Mountain Tire \$48,860.00 True HL Mountain Tire 2007 \$20,300.00 False HL Mountain Tire 2008 \$28,560.00 False HL Road Tire \$27,970.80 True HL Road Tire 2007 \$12,877.00 False HL Road Tire 2008 \$15,093.80 False Hydration Pack – 70 oz. \$40,307.67 True Hydration Pack – 70 oz. 2007 \$16,771.95 False Hydration Pack – 70 oz. 2008 \$23,535.72 False

The result set now includes the `Is``Rollup` column and is populated with `True` and `False` values, with each roll-up row receiving a value of `True`.

Now suppose that instead of grouping our data by product and then year sold, we group it by product category, then product subcategory, and finally the year sold, as shown in the following example:

As in the previous examples, we’re rolling up our data based on the `Calendar``Year` column. Consequently, the statement will provide a roll-up row for each category/subcategory grouping, as shown in the following results:

 Product Category Name Product Subcategory Name Calendar Year Total Sales Is Rollup Accessories Bike Racks \$39,360.00 True Accessories Bike Racks 2007 \$16,440.00 False Accessories Bike Racks 2008 \$22,920.00 False Accessories Bike Stands \$39,591.00 True Accessories Bike Stands 2007 \$18,921.00 False Accessories Bike Stands 2008 \$20,670.00 False Accessories Bottles and Cages \$56,798.19 True Accessories Bottles and Cages 2007 \$23,280.27 False Accessories Bottles and Cages 2008 \$33,517.92 False Accessories Cleaners \$7,218.60 True Accessories Cleaners 2007 \$3,044.85 False Accessories Cleaners 2008 \$4,173.75 False Accessories Fenders \$46,619.58 True Accessories Fenders 2007 \$19,408.34 False Accessories Fenders 2008 \$27,211.24 False Accessories Helmets \$225,335.60 True Accessories Helmets 2007 \$92,583.54 False Accessories Helmets 2008 \$132,752.06 False Accessories Hydration Packs \$40,307.67 True Accessories Hydration Packs 2007 \$16,771.95 False Accessories Hydration Packs 2008 \$23,535.72 False Accessories Tires and Tubes \$245,529.32 True Accessories Tires and Tubes 2007 \$103,259.76 False Accessories Tires and Tubes 2008 \$142,269.56 False Bikes Mountain Bikes \$9,952,759.56 True Bikes Mountain Bikes 2005 \$585,973.27 False Bikes Mountain Bikes 2006 \$1,562,456.76 False Bikes Mountain Bikes 2007 \$3,989,638.48 False Bikes Mountain Bikes 2008 \$3,814,691.06 False

This level of roll-up is probably fine for some of your needs, but at times, you might find it handy to return a roll-up row for each category, as well as each category/subcategory grouping. The easiest way to do that is to add the `Product``Subcategory``Name` column to the `Rollup` function, as shown in the following example:

When adding an additional column to the `Rollup` function, you must separate the column names with a comma, but that’s all you’ll need to do. As the following table shows, the results now include a roll-up row for each product category:

 Product Category Name Product Subcategory Name Calendar Year Total Sales Is Rollup Accessories \$700,759.96 True Accessories Bike Racks \$39,360.00 True Accessories Bike Racks 2007 \$16,440.00 False Accessories Bike Racks 2008 \$22,920.00 False Accessories Bike Stands \$39,591.00 True Accessories Bike Stands 2007 \$18,921.00 False Accessories Bike Stands 2008 \$20,670.00 False Accessories Bottles and Cages \$56,798.19 True Accessories Bottles and Cages 2007 \$23,280.27 False Accessories Bottles and Cages 2008 \$33,517.92 False Accessories Cleaners \$7,218.60 True Accessories Cleaners 2007 \$3,044.85 False Accessories Cleaners 2008 \$4,173.75 False Accessories Fenders \$46,619.58 True Accessories Fenders 2007 \$19,408.34 False Accessories Fenders 2008 \$27,211.24 False Accessories Helmets \$225,335.60 True Accessories Helmets 2007 \$92,583.54 False Accessories Helmets 2008 \$132,752.06 False Accessories Hydration Packs \$40,307.67 True Accessories Hydration Packs 2007 \$16,771.95 False Accessories Hydration Packs 2008 \$23,535.72 False Accessories Tires and Tubes \$245,529.32 True Accessories Tires and Tubes 2007 \$103,259.76 False Accessories Tires and Tubes 2008 \$142,269.56 False Bikes \$28,318,144.65 True Bikes Mountain Bikes \$9,952,759.56 True Bikes Mountain Bikes 2005 \$585,973.27 False Bikes Mountain Bikes 2006 \$1,562,456.76 False Bikes Mountain Bikes 2007 \$3,989,638.48 False Bikes Mountain Bikes 2008 \$3,814,691.06 False

The `Summarize` function supports one other related function worth mentioning: `RollupGroup`. When used in conjunction with the `Rollup` function, the `RollupGroup` function prevents partial subtotals from being included in the result set. Let’s look at another example to demonstrate how this works. In the following `Evaluate` statement, I’ve added the `RollupGroup` function as part of the `Rollup` function:

The column names are passed into the `RollupGroup` function, and that function is passed to the `Rollup` function. As to be expected, the statement now returns slightly different results:

 Product Category Name Product Subcategory Name Calendar Year Total Sales Is Rollup Accessories \$700,759.96 True Accessories Bike Racks 2007 \$16,440.00 False Accessories Bike Racks 2008 \$22,920.00 False Accessories Bike Stands 2007 \$18,921.00 False Accessories Bike Stands 2008 \$20,670.00 False Accessories Bottles and Cages 2007 \$23,280.27 False Accessories Bottles and Cages 2008 \$33,517.92 False Accessories Cleaners 2007 \$3,044.85 False Accessories Cleaners 2008 \$4,173.75 False Accessories Fenders 2007 \$19,408.34 False Accessories Fenders 2008 \$27,211.24 False Accessories Helmets 2007 \$92,583.54 False Accessories Helmets 2008 \$132,752.06 False Accessories Hydration Packs 2007 \$16,771.95 False Accessories Hydration Packs 2008 \$23,535.72 False Accessories Tires and Tubes 2007 \$103,259.76 False Accessories Tires and Tubes 2008 \$142,269.56 False Bikes \$28,318,144.65 True Bikes Mountain Bikes 2005 \$585,973.27 False Bikes Mountain Bikes 2006 \$1,562,456.76 False Bikes Mountain Bikes 2007 \$3,989,638.48 False Bikes Mountain Bikes 2008 \$3,814,691.06 False

A roll-up row is included for each category, but not for the subcategories. As you can see, DAX provides a number of ways to roll up your data so your results include exactly the information you need.

The Crossjoin Function

The last function that we’ll look at is `Crossjoin`. You use a `Crossjoin` function to join two or more tables in order to return a Cartesian product, a result set in which each row in each table is joined with each row in each other table. The result set is the product of the number of rows from each table. For example, if one table contains five rows and a second table contains seven rows, the Cartesian product for those two tables will contain 35 rows.

This will all become clearer as we work through our examples. But first, take a look at the following `Evaluate` statement, in which we retrieve data from the `Internet Sales` table:

The statement contains nothing that you haven’t seen before. It groups data based on the `Product``Category` and `Calendar``Year` columns and provides the sales totals for each group, as shown in the following results:

 Product Category Name Calendar Year Total Sales Accessories 2007 \$293,709.71 Accessories 2008 \$407,050.25 Bikes 2005 \$3,266,373.66 Bikes 2006 \$6,530,343.53 Bikes 2007 \$9,359,102.62 Bikes 2008 \$9,162,324.85 Clothing 2007 \$138,247.97 Clothing 2008 \$201,524.64

The `Evaluate` statement returns just what we’d expect. However, what’s important about these results are not what they include, but what is missing. For example, the `Product Category` column in the source table includes the value `Components`, but that value isn’t included in the results because no sales are associated with that category. In addition, years are displayed only if they have sales associated with them. For instance, the results include information about the Accessories category for 2007 and 2008, but no other years are listed.

Often, we’ll want our results to be more complete than what we’re able to generate with a simple `Summarize` function and for that we need the `Crossjoin` function. It returns a Cartesian product, and we can use that data to generate more complete results.

Let’s look at an example to give you an idea of how the `Crossjoin` function works. In the following `Evaluate` statement, I use the `Crossjoin` function as the statement’s table expression:

I pass two arguments into the `Crossjoin` function. Each argument must return a table. In this case, I use the `Values` function for each argument in order to return a one-column table of distinct values. The first `Values` function returns a list of values from the `Product Category Name` column, and the second `Values` function returns a list of values from the `Calendar Year` column. (These are the same two columns used to group data in the previous example.) The `Evaluate` statement returns the results shown in the following table:

 Product Category Name Calendar Year 2005 2006 2007 2008 2009 2010 Accessories 2005 Accessories 2006 Accessories 2007 Accessories 2008 Accessories 2009 Accessories 2010 Bikes 2005 Bikes 2006 Bikes 2007 Bikes 2008 Bikes 2009 Bikes 2010 Clothing 2005 Clothing 2006 Clothing 2007 Clothing 2008 Clothing 2009 Clothing 2010 Components 2005 Components 2006 Components 2007 Components 2008 Components 2009 Components 2010

Each row in each table has been joined with each row in the other table. For example, the `Accessories` value in the `Product Category Name` column is matched with each year in the `Calendar Year` column. The reason that there are blank values in the `Product Category Name` column is because the source data-the `Product Category` table in the tabular database-includes a blank member, and that blank member is matched with each year, just like the other `Product Category Name` values. Because the `Product Category` column contains five distinct values, and the `Calendar Year` column contains six distinct values, our query returns 30 rows.

Now that we know how to use the `Crossjoin` function to produce the Cartesian product, we can include it in our `Evaluate` statement to return more meaningful data. The following example uses the `Crossjoin` function as the table expression within the `Summarize` function:

The `Summarize` function is now working with a `Cartesian` product as its table. We can then group the data based on the `Product``Category``Name` column and the `Calendar``Year` column, both of which are including in the Cartesian product. Now the `Evaluate` statement returns the following results:

 Product Category Name Calendar Year Total Sales 2005 2006 2007 2008 2009 2010 Accessories 2005 Accessories 2006 Accessories 2007 \$293,709.71 Accessories 2008 \$407,050.25 Accessories 2009 Accessories 2010 Bikes 2005 \$3,266,373.66 Bikes 2006 \$6,530,343.53 Bikes 2007 \$9,359,102.62 Bikes 2008 \$9,162,324.85 Bikes 2009 Bikes 2010 Clothing 2005 Clothing 2006 Clothing 2007 \$138,247.97 Clothing 2008 \$201,524.64 Clothing 2009 Clothing 2010 Components 2005 Components 2006 Components 2007 Components 2008 Components 2009 Components 2010

As you can see, all the data is included in the results, whether or not any sales were made for a particular category in a particular year. This way, you know you’re returning the complete results, which can often give you a better picture of the data-and point to where potential problems might exist.

DAX Statistical Functions

You should now have a good sense of how to use the `Row`, `Topn`, `Rollup`, and `Crossjoin` functions, along with several of the aggregate statistical functions. Once you know the basics of how to use them, you should be able to apply them to other situations. DAX also supports a number of other statistical functions, as well as additional functions types, such as date and time functions, filter functions, and logical functions. Be sure to check out the TechNet article “DAX Function Reference” for information about those other functions as well as additional details about the functions we’ve covered.

Robert Sheldon

See Profile

Robert is a freelance technology writer based in the Pacific Northwest. He’s worked as a technical consultant and has written hundreds of articles about technology for both print and online publications, with topics ranging from predictive analytics to 5D storage to the dark web. He’s also contributed to over a dozen books on technology, developed courseware for Microsoft’s training program, and served as a developmental editor on Microsoft certification exams. When not writing about technology, he’s working on a novel or venturing out into the spectacular Northwest woods.

228

6