SQL Server Statistics Basics

Distribution statistics are used by SQL Server's Query Optimiser to determine a good execution plan for your SQL query. You don't need to know about them to execute queries, but the better you understand them, the better you can optimise your queries and sort out performance problems. Robert Sheldon once more provides a simple guide.

The SQL Server query optimizer uses distribution statistics when creating a query plan. The statistics provide information about the distribution of column values across participating rows, helping the optimizer better estimate the number of rows, or cardinality, of the query results.

The statistics provide the optimizer with clues on the best ways to process the query and consequently maximize performance. For example, the statistics might lead the optimizer to choose an index seek over an index scan, a potentially more resource-intensive operation. Without statics, it would be difficult for the optimizer to determine the best processing strategy in advance.

The query optimizer, for the most part, takes care of generating and updating the distribution statistics, creating them on certain columns within a table or indexed view. Each set of statistics is stored as a table-level object that tracks the distribution of values across the first column in any given set. If a statistics object is defined on multiple columns, the object also stores details about the correlation of values across the columns. These correlation statistics, referred to as densities, are based on the number of distinct rows of column values.

Although densities play an important role in query optimization, this article is concerned primarily with distribution statistics as they are accessed through the statistics object. SQL Server provides the statistics in the form of a histogram that shows how data is distributed within a column.

When the query optimizer is doing its job and your queries are performing as you would expect, you don’t need to worry too much about histograms, statistics objects, or distribution statistics in general. On the other hand, if you’re trying to maximize query performance or track down performance-related issues, it can be worth investigating how the statistics are being used and generated and perhaps making adjustments accordingly.

Before you can do that, however, you must have a basic understanding of how statistics are implemented in SQL Server, how to track down their related histograms, and how to make sense of the information, which is what we address in this article.

Getting started

As with many areas of SQL Server, distribution statistics can be easier to understand if you see them in action, rather than simply reading about them in the abstract. To help with that process, this article includes a number of examples that demonstrate how distribution statistics get generated and how to access information about them.

For these examples, I used the following T-SQL script to create the AWSales table and populate it with data from the AdventureWorks2014 sample database:

That’s all the set-up you need to do to try out the examples for yourself. I created the examples on a local instance of SQL Server 2014, but you should be able to apply the principles we cover here to any version of SQL Server from 2008 onward, as well as to Azure SQL Database, Azure SQL Data Warehouse, and Parallel Data Warehouse.

Retrieving information about a statistics object

SQL Server generates a statistics object automatically when you create an index on a table or indexed view. For example, when we created the AWSales table, SQL Server generated a statistics object based on the OrderID column. Because the column is defined as the primary key, SQL Server automatically created a clustered index on that column and generated the associated statistics object.

SQL Server provides the following two system views for retrieving high-level information about the statistics objects associated with a table or indexed view:

  •  sys.stats: Returns the ID of the table or indexed view, name and ID of the statistics object, filter expression (if applicable), and a number of bit flags that indicate such information as whether the statistics object was auto-generated or user-defined.

  • sys.stats_columns: Returns the parent object ID, statistics object ID, ordinal position of each column included in the statistics object, and the column ID, as it is listed within sys.columns.

You can join the two views based on the object and statistic IDs in order to retrieve specific types of information. You can also add the sys.columns view to retrieve the actual column names, as shown in the following example:

In this case, all I’m interested in the table ID and statistics ID, along with the columns on which the statistics are based. As the following results show, the table currently includes only one statistics object and that object is based on only one column, OrderID.

StatsID

StatsName

StatsColID

ColumnName

1

PK__AWSales__C3905BAF9E58F2FD

1

OrderID

When a statistics object is generated as a result of creating an index, the object receives the index name. In addition, if the statistics object had been based on multiple columns, the SELECT statement would have returned a row for each column.

Column-based statistics objects

SQL Server also generates a statistics object when you include a column in a query predicate such as a WHERE clause. For example, suppose we query the AWSales table based on a ProductID value:

I tagged on the RECOMPILE query hint to force the optimizer to discard the query plan and ensure fresh results each time I ran the SELECT statement and viewed the execution plan, which we’ll discuss shortly.

On my system, the query returned nine rows. I then reran the following SELECT statement to see an updated list of statistics objects:

The SELECT statement is the same as in the preceding section, only now we have a new statistics object, as shown in the following results.

StatsID

StatsName

StatsColID

ColumnName

1

PK__AWSales__C3905BAF9E58F2FD

1

OrderID

2

_WA_Sys_00000004_6C5905DD

1

ProductID

SQL Server automatically created the _WA_Sys_00000004_6C5905DD statistics object based on the ProductID column. In this case, the name of the auto-generated object is prefixed with _WA, the two-letter code for Washington state in the US. For SQL Server to generate the column-based objects, the AUTO_CREATE_STATISTICS database option must be enabled, which it is by default.

Index-based statistics objects

As we discussed earlier, when we defined a primary key on our table, SQL Server generated a statistics object based on the clustered index key (the OrderID column). SQL Server will also generate a statistics object if we create an index manually, such as the one shown in the following script:

In this case, we’ve created the ixOrders nonclustered index on the SalesOrderID and OrderDetailID columns. If we were to now query the sys.stats and sys.stats_columns system views, using the same SELECT statement as before, we would find that a third statistics object has been added to the AWSales table, as shown in the following results.

StatsID

StatsName

StatsColID

ColumnName

1

PK__AWSales__C3905BAF9E58F2FD

1

OrderID

2

_WA_Sys_00000004_6C5905DD

1

ProductID

3

ixOrders

1

SalesOrderID

3

ixOrders

2

OrderDetailID

The results include two rows for the new statistics object, one for each column in the ixOrders index. Notice that the StatsColID values are incremented for each column, but the name and ID of the statistics object are the same in each row, along with the object name being the same as the index name.

Statistics histograms

Up to this point, we’ve seen only how statistics objects are added. Although the sys.stats and sys.stats_columns system views are useful for retrieving high-level information, they provide us with no insight into the actual distribution statistics. For that, we must turn to the DBCC SHOW_STATISTICS statement.

By default, the DBCC SHOW_STATISTICS statement returns the following three types of information:

  • Header: Name of the statistics object, date and time it was last updated, number of rows in the participating columns, actual number of sampled rows, number of steps within the histogram, and other information

  •  Densities: Density vector of the indexed columns based on the formula 1/#_distinct_values. The number of distinct values help to determine the selectivity of the column values. The more unique values, the higher the selectivity and the more effective the index. The number of rows in this section depends on the number of participating columns.

  • Histogram: Value distribution across the column data, incremented in steps based on the number of rows, up to 200 steps per statistics object.

At a minimum, when running the DBCC SHOW_STATISTICS statement, you must specify the target table or indexed view and the target index, column, or statistics object. For example, the following DBCC SHOW_STATISTICS statement specifies the AWSales table and the ixOrders statistics object:

The DBCC SHOW_STATISTICS statement supports several other options as well, such as the ability to specify that the results include only the histogram, but for now, we’ll stick with the most basic form of the statement, which returns the results shown in the following figure.

2444-clip_image002.gif

The first section provides the header information, the second section shows the densities, and the third section gives us our histogram.

In this case, the histogram is broken into 184 steps, based on ranges of data in the first indexed column, SalesOrderID. Each step is listed in its own row in the histogram and represents a range of values. For example, the value in the RANGE_HI_KEY column of row 7 (the highlighted row) is 44133. This value represents the highest value in a data range that includes all values from 44101 (row 6 + 1) through 44113.

The histogram also includes the RANGE_ROWS column, which provides the number of rows within the range, excluding the rows associated with the RANGE_HI_KEY value, the upper end of the range. In this case, there are 343 rows in the range, excluding the upper range value. However, the EQ_ROWS value indicates the number of rows that are associated with the upper end value. That means 17 rows have a the SalesOrderID value of 44133.

The DISTINCT_RANGE_ROWS column shows the number of unique values within the current range, once again excluding the upper range value. As a result, the range of values from 44101 through 44132 includes 32 unique values.

Finally, the AVG_RANGE_ROWS column shows the average number of rows for each distinct value, based on the formula range_rows/distinct_range_rows, once again excluding the upper end rows. This gives us an average of over 10 rows for each distinct value within the range.

The query optimizer uses these values in various ways to estimate the number of rows that will be in the result set. For example, suppose we run the following SELECT statement, which limits the result set to rows with a SalesOrderID value of 44133 (row 7 in our histogram).

Now let’s look at the execution plan used for the query. The following figure shows the pop-up details for the SELECT node of the execution plan, as it appeared in my local installation of SQL Server Management Studio (SSMS).

2444-clip_image004.gif

Notice that the Estimated Number of Rows amount shows 17. This figure corresponds exactly to the EQ_ROWS value in the statistics histogram for row 7. This is also the number of rows that the query returned.

Now suppose we specify the SalesOrderID value of 44134 in our query:

This time, the query returns only one row, but the execution plan indicates something other than 1, as shown in the following figure.

2444-clip_image006.gif

Because we’ve reference a SalesOrderID value that is within the next range, but is not its own row, the execution plan uses the AVG_RANGE_ROWS value from row 8, which in this case is 1.493507, the average number of rows per distinct values with that range. (The optimizer rounds the figure up to 1.49351.)

Although this is not a big deal when returning one row, it could make a difference when returning multiple rows from a large data set, leading you to consider creating statistics or updating them. (More on that in a bit.)

Let’s look at one more example. This time, we’ll specify a range of SalesOrderID values:

The statement returns 634 rows, which are all the rows with a SalesOrderID value less than 43861 (row 3 in the histogram). The execution plan also comes up with an Estimated Number of Rows value of 634, as shown in the following figure.

2444-clip_image008.gif

To arrive at 634, optimizer added together statistics from rows 1, 2, and 3, giving us the following formula:

12 + 282 + 28 + 312 = 634

As this last example demonstrates, the query optimizer can use the distribution statistics in various ways to estimate the number of rows to expect in the result set. Be aware, however, that the logic can become much more complex than what we’ve shown here, depending on how the data is being queried. The point is, the distribution statistics play a pivotal row in calculating the query plan, and knowing how to access those statistics is vital in order to make sense of them.

Creating statistics

At times, the statistics automatically generated by SQL Server won’t be optimal for the type of queries you’re performing, in which case, you can create your own statistics. Microsoft recommends that you create statistics under the following circumstances:

  • The Database Engine Tuning Advisor recommends it.

  • A query retrieves only a subset of data from a targeted column or index.

  • A query’s predicate references multiple correlated columns in separate indexes.

  • The query optimizer is missing the statistics it needs to generate an effective query plan.

For example, suppose we want to query data in the AWSales table based on the values in the LineTotal columns, but our queries generally target only rows with a LineTotal value greater than $100. We can create a filtered statistics object based on those values, as shown in the following example:

The CREATE STATISTICS statement lets us create a statistics object on one or more columns in a table or indexed view. In this case, we’re targeting the LineTotal column in the AWSales table, but only for values over 100, as indicated in the WHERE clause. This is our filter, thus the term filtered statistics.

The CREATE STATISTICS statement also includes the WITH FULLSCAN clause, which specifies that all rows should be scanned when creating the statistics. However, you can instead specify that a percentage of the data be sampled, rather than every row.

If we were to now query the sys.stats and sys.stats_columns system views, using the same SELECT statement as before, we would find that a fourth statistics object, TotalStats, has been added to the AWSales table, as shown in the following results.

StatsID

StatsName

StatsColID

ColumnName

1

PK__AWSales__C3905BAF9E58F2FD

1

OrderID

2

_WA_Sys_00000004_6C5905DD

1

ProductID

3

ixOrders

1

SalesOrderID

3

ixOrders

2

OrderDetailID

4

TotalStats

1

LineTotal

We can also run a DBCC SHOW_STATISTICS statement against the new statistics object to see how the histogram has been mapped out:

The following table shows the statement results. Notice that line 7, highlighted in the histogram, has a RANGE_HI_KEY value of 120 and an EQ_ROWS value of 99.

2444-clip_image010.gif

Now let’s query the AWSales once again, this time including a WHERE clause that specifies that the LineTotal value should equal 120:

The statement returns 99 rows, which is the same value in the EQ_ROWS column of the histogram and the same value as the Estimated Number of Rows value in the execution plan, as shown in the following figure.

2444-clip_image012.gif

You can, of course, play with your queries and compare the execution plan estimates with the histogram. In fact, this is a good way to get a better sense of how execution plans relate to the statistics. Just know that some of the optimizer’s calculations can be a bit complex, with the logic of how it got from point A to point B less than apparent.

Updating statistics

As with generating statistics, SQL Server usually does a good job keeping them up-to-date. In some cases, however, you will need to intervene and update the statistics manually. Microsoft recommends that you update statistics under the following circumstances:

  • Your queries are running slower than you would expect.

  • You insert data into ascending or descending key columns such as IDENTITY columns.

  • You perform certain maintenance operations, such as truncating a table or performing a bulk insert.

For example, suppose we add another chunk of data to the AWSales table, using the following INSERT statement:

If we were to run the following DBCC SHOW_STATISTICS statement again, we would receive the same results as when we ran it in the previous section:

On my system, the Updated column in the header results showed the date as July 9, 2016 and the time as 2:02 PM, a time before I ran the preceding INSERT statement, indicating that the histogram remained unchanged, despite the addition of over 60,000 rows.

To make certain our queries have the most up-to-date statistics, we can run an UPDATE STATISTICS statement against the AWSales table:

I’ve included the WITH FULLSCAN clause to tell SQL Server to scan the entire column in each statistics object when updating the statistics. By default, SQL Server uses only a sample of the data, rather than calculating the statistics based on all data. A full scan normally provides more accurate statistics, but it can result in more resource-intensive operations.

After updating the statistics, we can re-run the DBCC SHOW_STATISTICS statement, which gives us the results shown in the following figure.

2444-clip_image014.gif

Notice that the time has been updated and that the histogram steps define slightly different ranges. For a small data set, these differences might not seem like much, but when you start talking billions of rows, updating statistics after certain operations can be integral to maintaining your query’s performance.

More on statistics

There’s a lot more to statistics than what we’ve covered here, but this should give you a general sense of how you can start working with them if you find your query performance lagging. The key is in knowing where to look and how to correlate what you find to the optimizer’s execution plans.

To this end, the DBCC SHOW_STATISTICS statement will likely be one of your best tools, along with the ability to view execution plans in SSMS. However, also be sure to refer to SQL Server documentation for more details about the DBCC SHOW_STATISTICS statement, the sys.stats and sys.stats_columns system views, and distribution statistics in general. The better you understand how statistics work, the better you can optimize your queries to ensure their maximum performance.