{"id":54563,"date":"2016-07-22T00:00:00","date_gmt":"2016-07-22T00:00:00","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/sql-server-statistics-basics\/"},"modified":"2022-05-06T17:32:41","modified_gmt":"2022-05-06T17:32:41","slug":"sql-server-statistics-basics","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/performance-sql-server\/sql-server-statistics-basics\/","title":{"rendered":"SQL Server Statistics Basics"},"content":{"rendered":"<div class=\"article-content\">\n<p>The SQL Server query optimizer uses distribution statistics when creating a query plan. The statistics provide information about the distribution of column values across participating rows, helping the optimizer better estimate the number of rows, or <i>cardinality,<\/i> of the query results.<\/p>\n<p>The statistics provide the optimizer with clues on the best ways to process the query and consequently maximize performance. For example, the statistics might lead the optimizer to choose an index seek over an index scan, a potentially more resource-intensive operation. Without statics, it would be difficult for the optimizer to determine the best processing strategy in advance.<\/p>\n<p>The query optimizer, for the most part, takes care of generating and updating the distribution statistics, creating them on certain columns within a table or indexed view. Each set of statistics is stored as a table-level object that tracks the distribution of values across the first column in any given set. If a statistics object is defined on multiple columns, the object also stores details about the correlation of values across the columns. These correlation statistics, referred to as <i>densities,<\/i> are based on the number of distinct rows of column values.<\/p>\n<p>Although densities play an important role in query optimization, this article is concerned primarily with distribution statistics as they are accessed through the statistics object. SQL Server provides the statistics in the form of a histogram that shows how data is distributed within a column.<\/p>\n<p>When the query optimizer is doing its job and your queries are performing as you would expect, you don&#8217;t need to worry too much about histograms, statistics objects, or distribution statistics in general. On the other hand, if you&#8217;re trying to maximize query performance or track down performance-related issues, it can be worth investigating how the statistics are being used and generated and perhaps making adjustments accordingly.<\/p>\n<p>Before you can do that, however, you must have a basic understanding of how statistics are implemented in SQL Server, how to track down their related histograms, and how to make sense of the information, which is what we address in this article.<\/p>\n<h1>Getting started<\/h1>\n<p>As with many areas of SQL Server, distribution statistics can be easier to understand if you see them in action, rather than simply reading about them in the abstract. To help with that process, this article includes a number of examples that demonstrate how distribution statistics get generated and how to access information about them.<\/p>\n<p>For these examples, I used the following T-SQL script to create the <code>AWSales<\/code> table and populate it with data from the <code>AdventureWorks2014<\/code> sample database:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">IF OBJECT_ID (N'dbo.AWSales', N'U') IS NOT NULL\r\nDROP TABLE dbo.AWSales;\r\nGO\r\n\u00a0\r\nCREATE TABLE dbo.AWSales\r\n(\r\n\u00a0 OrderID INT PRIMARY KEY IDENTITY(101, 1),\r\n\u00a0 SalesOrderID INT NOT NULL,\r\n\u00a0 OrderDetailID INT NOT NULL,\r\n\u00a0 ProductID INT NOT NULL,\r\n\u00a0 LineTotal MONEY NOT NULL\r\n);\r\n\r\nINSERT INTO dbo.AWSales \r\n\u00a0 (SalesOrderID, OrderDetailID, ProductID, LineTotal)\r\nSELECT SalesOrderID, SalesOrderDetailID, ProductID, LineTotal\r\nFROM AdventureWorks2014.Sales.SalesOrderDetail\r\nWHERE SalesOrderDetailID &lt; 60000;\r\n\t<\/pre>\n<p>That&#8217;s all the set-up you need to do to try out the examples for yourself. I created the examples on a local instance of SQL Server 2014, but you should be able to apply the principles we cover here to any version of SQL Server from 2008 onward, as well as to Azure SQL Database, Azure SQL Data Warehouse, and Parallel Data Warehouse.<\/p>\n<h1>Retrieving information about a statistics object<\/h1>\n<p>SQL Server generates a statistics object automatically when you create an index on a table or indexed view. For example, when we created the <code>AWSales<\/code> table, SQL Server generated a statistics object based on the <code>OrderID<\/code> column. Because the column is defined as the primary key, SQL Server automatically created a clustered index on that column and generated the associated statistics object.<\/p>\n<p>SQL Server provides the following two system views for retrieving high-level information about the statistics objects associated with a table or indexed view:<\/p>\n<ul>\n<li>\n<p class=\"MsoListParagraphCxSpFirst\">\u00a0<code>sys.stats:<\/code> Returns the ID of the table or indexed view, name and ID of the statistics object, filter expression (if applicable), and a number of bit flags that indicate such information as whether the statistics object was auto-generated or user-defined.<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpLast\"><code>sys.stats_columns:<\/code> Returns the parent object ID, statistics object ID, ordinal position of each column included in the statistics object, and the column ID, as it is listed within <code>sys.columns<\/code>.<\/p>\n<\/li>\n<\/ul>\n<p>You can join the two views based on the object and statistic IDs in order to retrieve specific types of information. You can also add the <code>sys.columns<\/code> view to retrieve the actual column names, as shown in the following example:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">SELECT s.stats_id StatsID,\r\n\u00a0 s.name StatsName,\r\n\u00a0 sc.stats_column_id StatsColID,\r\n\u00a0 c.name ColumnName \r\nFROM sys.stats s \r\n\u00a0 INNER JOIN sys.stats_columns sc\r\n\u00a0\u00a0\u00a0 ON s.object_id = sc.object_id AND s.stats_id = sc.stats_id\r\n\u00a0 INNER JOIN sys.columns c\r\n\u00a0\u00a0\u00a0 ON sc.object_id = c.object_id AND sc.column_id = c.column_id\r\nWHERE OBJECT_NAME(s.object_id) = 'awsales'\r\nORDER BY s.stats_id, sc.column_id;\r\n\r\n<\/pre>\n<p>In this case, all I&#8217;m interested in the table ID and statistics ID, along with the columns on which the statistics are based. As the following results show, the table currently includes only one statistics object and that object is based on only one column, <code>OrderID<\/code>.<\/p>\n<table class=\"MsoTableGrid\">\n<tbody>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsName<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsColID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> ColumnName<\/b><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">PK__AWSales__C3905BAF9E58F2FD<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">OrderID<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>When a statistics object is generated as a result of creating an index, the object receives the index name. In addition, if the statistics object had been based on multiple columns, the <code>SELECT<\/code> statement would have returned a row for each column.<\/p>\n<h1>Column-based statistics objects<\/h1>\n<p>SQL Server also generates a statistics object when you include a column in a query predicate such as a <code>WHERE<\/code> clause. For example, suppose we query the <code>AWSales<\/code> table based on a <code>ProductID<\/code> value:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">SELECT *\r\nFROM AWSales \r\nWHERE ProductID = 898\r\nOPTION(RECOMPILE);\r\n\r\n<\/pre>\n<p>I tagged on the <code>RECOMPILE<\/code> query hint to force the optimizer to discard the query plan and ensure fresh results each time I ran the <code>SELECT<\/code> statement and viewed the execution plan, which we&#8217;ll discuss shortly.<\/p>\n<p>On my system, the query returned nine rows. I then reran the following <code>SELECT<\/code> statement to see an updated list of statistics objects:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">SELECT s.stats_id StatsID,\r\n\u00a0 s.name StatsName,\r\n\u00a0 sc.stats_column_id StatsColID,\r\n\u00a0 c.name ColumnName \r\nFROM sys.stats s \r\n\u00a0 INNER JOIN sys.stats_columns sc\r\n\u00a0\u00a0\u00a0 ON s.object_id = sc.object_id AND s.stats_id = sc.stats_id\r\n\u00a0 INNER JOIN sys.columns c\r\n\u00a0\u00a0\u00a0 ON sc.object_id = c.object_id AND sc.column_id = c.column_id\r\nWHERE OBJECT_NAME(s.object_id) = 'awsales'\r\nORDER BY s.stats_id, sc.column_id;\r\n<\/pre>\n<p>The <code>SELECT<\/code> statement is the same as in the preceding section, only now we have a new statistics object, as shown in the following results.<\/p>\n<table class=\"MsoTableGrid\">\n<tbody>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsName<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsColID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> ColumnName<\/b><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">PK__AWSales__C3905BAF9E58F2FD<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">OrderID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">2<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">_WA_Sys_00000004_6C5905DD<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">ProductID<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>SQL Server automatically created the <code>_WA_Sys_00000004_6C5905DD<\/code> statistics object based on the <code>ProductID<\/code> column. In this case, the name of the auto-generated object is prefixed with <code>_WA<\/code>, the two-letter code for Washington state in the US. For SQL Server to generate the column-based objects, the <code>AUTO_CREATE_STATISTICS<\/code> database option must be enabled, which it is by default.<\/p>\n<h1>Index-based statistics objects<\/h1>\n<p>As we discussed earlier, when we defined a primary key on our table, SQL Server generated a statistics object based on the clustered index key (the <code>OrderID<\/code> column). SQL Server will also generate a statistics object if we create an index manually, such as the one shown in the following script:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">IF EXISTS (SELECT name FROM sys.indexes WHERE name = N'ixOrders')\u00a0 \r\nDROP INDEX ixOrders ON dbo.AWSales; \r\nGO\r\n\r\nCREATE INDEX ixOrders ON dbo.AWSales(SalesOrderID, OrderDetailID);\r\n\r\n<\/pre>\n<p>In this case, we&#8217;ve created the <code>ixOrders<\/code> nonclustered index on the <code>SalesOrderID<\/code> and <code>OrderDetailID<\/code> columns. If we were to now query the <code>sys.stats<\/code> and <code>sys.stats_columns<\/code> system views, using the same <code>SELECT<\/code> statement as before, we would find that a third statistics object has been added to the <code>AWSales<\/code> table, as shown in the following results.<\/p>\n<table class=\"MsoTableGrid\">\n<tbody>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsName<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsColID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> ColumnName<\/b><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">PK__AWSales__C3905BAF9E58F2FD<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">OrderID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">2<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">_WA_Sys_00000004_6C5905DD<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">ProductID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">3<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">ixOrders<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">SalesOrderID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">3<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">ixOrders<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">2<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">OrderDetailID<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>The results include two rows for the new statistics object, one for each column in the <code>ixOrders<\/code> index. Notice that the <code>StatsColID<\/code> values are incremented for each column, but the name and ID of the statistics object are the same in each row, along with the object name being the same as the index name.<\/p>\n<h1>Statistics histograms<\/h1>\n<p>Up to this point, we&#8217;ve seen only how statistics objects are added. Although the <code>sys.stats<\/code> and <code>sys.stats_columns<\/code> system views are useful for retrieving high-level information, they provide us with no insight into the actual distribution statistics. For that, we must turn to the <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement.<\/p>\n<p>By default, the <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement returns the following three types of information:<\/p>\n<ul>\n<li>\n<p class=\"MsoListParagraphCxSpFirst\"><b>Header:<\/b> Name of the statistics object, date and time it was last updated, number of rows in the participating columns, actual number of sampled rows, number of steps within the histogram, and other information<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpMiddle\">\u00a0<b>Densities:<\/b> Density vector of the indexed columns based on the formula <i>1\/#_distinct_values<\/i>. The number of distinct values help to determine the selectivity of the column values. The more unique values, the higher the selectivity and the more effective the index. The number of rows in this section depends on the number of participating columns.<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpLast\"><b>Histogram:<\/b> Value distribution across the column data, incremented in steps based on the number of rows, up to 200 steps per statistics object.<\/p>\n<\/li>\n<\/ul>\n<p>At a minimum, when running the <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement, you must specify the target table or indexed view and the target index, column, or statistics object. For example, the following <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement specifies the <code>AWSales<\/code> table and the <code>ixOrders<\/code> statistics object:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DBCC SHOW_STATISTICS (AWSales, ixOrders);\r\n\t<\/pre>\n<p>The <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement supports several other options as well, such as the ability to specify that the results include only the histogram, but for now, we&#8217;ll stick with the most basic form of the statement, which returns the results shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2444-clip_image002.gif\" alt=\"2444-clip_image002.gif\" width=\"623\" height=\"267\" \/><\/p>\n<p>The first section provides the header information, the second section shows the densities, and the third section gives us our histogram.<\/p>\n<p>In this case, the histogram is broken into 184 steps, based on ranges of data in the first indexed column, <code>SalesOrderID<\/code>. Each step is listed in its own row in the histogram and represents a range of values. For example, the value in the <code>RANGE_HI_KEY<\/code> column of row 7 (the highlighted row) is <code>44133<\/code>. This value represents the highest value in a data range that includes all values from <code>44101<\/code> (row 6 + 1) through <code>44113<\/code>.<\/p>\n<p>The histogram also includes the <code>RANGE_ROWS<\/code> column, which provides the number of rows within the range, excluding the rows associated with the <code>RANGE_HI_KEY<\/code> value, the upper end of the range. In this case, there are 343 rows in the range, excluding the upper range value. However, the <code>EQ_ROWS<\/code> value indicates the number of rows that <i>are<\/i> associated with the upper end value. That means 17 rows have a the <code>SalesOrderID<\/code> value of <code>44133<\/code>.<\/p>\n<p>The <code>DISTINCT_RANGE_ROWS<\/code> column shows the number of unique values within the current range, once again excluding the upper range value. As a result, the range of values from <code>44101<\/code> through <code>44132<\/code> includes 32 unique values.<\/p>\n<p>Finally, the <code>AVG_RANGE_ROWS<\/code> column shows the average number of rows for each distinct value, based on the formula <i>range_rows\/distinct_range_rows,<\/i> once again excluding the upper end rows. This gives us an average of over 10 rows for each distinct value within the range.<\/p>\n<p>The query optimizer uses these values in various ways to estimate the number of rows that will be in the result set. For example, suppose we run the following <code>SELECT<\/code> statement, which limits the result set to rows with a <code>SalesOrderID<\/code> value of <code>44133<\/code> (row 7 in our histogram).<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">SELECT *\r\nFROM AWSales \r\nWHERE SalesOrderID = 44133\r\nOPTION(RECOMPILE);\r\n<\/pre>\n<p>Now let&#8217;s look at the execution plan used for the query. The following figure shows the pop-up details for the <code>SELECT<\/code> node of the execution plan, as it appeared in my local installation of SQL Server Management Studio (SSMS).<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2444-clip_image004.gif\" alt=\"2444-clip_image004.gif\" width=\"452\" height=\"252\" \/><\/p>\n<p>Notice that the <code>Estimated<\/code> <code>Number<\/code> <code>of<\/code> <code>Rows<\/code> amount shows <code>17<\/code>. This figure corresponds exactly to the <code>EQ_ROWS<\/code> value in the statistics histogram for row 7. This is also the number of rows that the query returned.<\/p>\n<p>Now suppose we specify the <code>SalesOrderID<\/code> value of <code>44134<\/code> in our query:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">SELECT *\r\nFROM AWSales\r\nWHERE SalesOrderID = 44134\r\nOPTION(RECOMPILE);\r\n\r\n<\/pre>\n<p>This time, the query returns only one row, but the execution plan indicates something other than 1, as shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2444-clip_image006.gif\" alt=\"2444-clip_image006.gif\" width=\"450\" height=\"277\" \/><\/p>\n<p>Because we&#8217;ve reference a <code>SalesOrderID<\/code> value that is within the next range, but is not its own row, the execution plan uses the <code>AVG_RANGE_ROWS<\/code> value from row 8, which in this case is <code>1.493507<\/code>, the average number of rows per distinct values with that range. (The optimizer rounds the figure up to 1.49351.)<\/p>\n<p>Although this is not a big deal when returning one row, it could make a difference when returning multiple rows from a large data set, leading you to consider creating statistics or updating them. (More on that in a bit.)<\/p>\n<p>Let&#8217;s look at one more example. This time, we&#8217;ll specify a range of <code>SalesOrderID<\/code> values:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">SELECT *\r\nFROM AWSales\r\nWHERE SalesOrderID &lt; 43861\r\nOPTION(RECOMPILE);\r\n\r\n<\/pre>\n<p>The statement returns 634 rows, which are all the rows with a <code>SalesOrderID<\/code> value less than <code>43861<\/code> (row 3 in the histogram). The execution plan also comes up with an <code>Estimated<\/code> <code>Number<\/code> <code>of<\/code> <code>Rows<\/code> value of <code>634<\/code>, as shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2444-clip_image008.gif\" alt=\"2444-clip_image008.gif\" width=\"462\" height=\"316\" \/><\/p>\n<p>To arrive at 634, optimizer added together statistics from rows 1, 2, and 3, giving us the following formula:<\/p>\n<p><code>12 + 282 + 28 + 312 = 634<\/code><\/p>\n<p>As this last example demonstrates, the query optimizer can use the distribution statistics in various ways to estimate the number of rows to expect in the result set. Be aware, however, that the logic can become much more complex than what we&#8217;ve shown here, depending on how the data is being queried. The point is, the distribution statistics play a pivotal row in calculating the query plan, and knowing how to access those statistics is vital in order to make sense of them.<\/p>\n<h1>Creating statistics<\/h1>\n<p>At times, the statistics automatically generated by SQL Server won&#8217;t be optimal for the type of queries you&#8217;re performing, in which case, you can create your own statistics. Microsoft recommends that you create statistics under the following circumstances:<\/p>\n<ul>\n<li>\n<p class=\"MsoListParagraphCxSpFirst\">The Database Engine Tuning Advisor recommends it.<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpMiddle\">A query retrieves only a subset of data from a targeted column or index.<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpMiddle\">A query&#8217;s predicate references multiple correlated columns in separate indexes.<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpLast\">The query optimizer is missing the statistics it needs to generate an effective query plan.<\/p>\n<\/li>\n<\/ul>\n<p>For example, suppose we want to query data in the <code>AWSales<\/code> table based on the values in the <code>LineTotal<\/code> columns, but our queries generally target only rows with a <code>LineTotal<\/code> value greater than $100. We can create a filtered statistics object based on those values, as shown in the following example:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">IF EXISTS(SELECT name FROM sys.stats\u00a0 \r\n\u00a0 WHERE name = N'TotalStats'\u00a0 \r\n\u00a0\u00a0\u00a0 AND object_id = OBJECT_ID(N'dbo.AWSales'))\r\nDROP STATISTICS dbo.AWSales.TotalStats;\u00a0 \r\nGO \r\n\u00a0\r\nCREATE STATISTICS TotalStats ON dbo.AWSales(LineTotal)\u00a0 \r\nWHERE LineTotal &gt; 100\u00a0 \r\nWITH FULLSCAN;\r\n\r\n<\/pre>\n<p>The <code>CREATE<\/code> <code>STATISTICS<\/code> statement lets us create a statistics object on one or more columns in a table or indexed view. In this case, we&#8217;re targeting the <code>LineTotal<\/code> column in the <code>AWSales<\/code> table, but only for values over 100, as indicated in the <code>WHERE<\/code> clause. This is our filter, thus the term <i> filtered statistics<\/i>.<\/p>\n<p>The <code>CREATE<\/code> <code>STATISTICS<\/code> statement also includes the <code>WITH<\/code> <code>FULLSCAN<\/code> clause, which specifies that all rows should be scanned when creating the statistics. However, you can instead specify that a percentage of the data be sampled, rather than every row.<\/p>\n<p>If we were to now query the <code>sys.stats<\/code> and <code>sys.stats_columns<\/code> system views, using the same <code>SELECT<\/code> statement as before, we would find that a fourth statistics object, <code>TotalStats<\/code>, has been added to the <code>AWSales<\/code> table, as shown in the following results.<\/p>\n<table class=\"MsoTableGrid\">\n<tbody>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsName<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> StatsColID<\/b><\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\"><b> ColumnName<\/b><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">PK__AWSales__C3905BAF9E58F2FD<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">OrderID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">2<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">_WA_Sys_00000004_6C5905DD<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">ProductID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">3<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">ixOrders<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">SalesOrderID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">3<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">ixOrders<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">2<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">OrderDetailID<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td valign=\"top\">\n<p class=\"MsoNormal\">4<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">TotalStats<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">1<\/p>\n<\/td>\n<td valign=\"top\">\n<p class=\"MsoNormal\">LineTotal<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We can also run a <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement against the new statistics object to see how the histogram has been mapped out:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DBCC SHOW_STATISTICS (AWSales, TotalStats);\r\n\t<\/pre>\n<p>The following table shows the statement results. Notice that line 7, highlighted in the histogram, has a <code>RANGE_HI_KEY<\/code> value of <code>120<\/code> and an <code>EQ_ROWS<\/code> value of <code>99<\/code>.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2444-clip_image010.gif\" alt=\"2444-clip_image010.gif\" width=\"624\" height=\"265\" \/><\/p>\n<p>Now let&#8217;s query the <code>AWSales<\/code> once again, this time including a <code>WHERE<\/code> clause that specifies that the <code>LineTotal<\/code> value should equal <code>120<\/code>:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">SELECT *\r\nFROM AWSales\r\nWHERE LineTotal = 120\r\nOPTION(RECOMPILE);\r\n\t<\/pre>\n<p>The statement returns 99 rows, which is the same value in the <code>EQ_ROWS<\/code> column of the histogram and the same value as the <code>Estimated<\/code> <code>Number<\/code> <code>of<\/code> <code>Rows<\/code> value in the execution plan, as shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2444-clip_image012.gif\" alt=\"2444-clip_image012.gif\" width=\"622\" height=\"218\" \/><\/p>\n<p>You can, of course, play with your queries and compare the execution plan estimates with the histogram. In fact, this is a good way to get a better sense of how execution plans relate to the statistics. Just know that some of the optimizer&#8217;s calculations can be a bit complex, with the logic of how it got from point A to point B less than apparent.<\/p>\n<h1>Updating statistics<\/h1>\n<p>As with generating statistics, SQL Server usually does a good job keeping them up-to-date. In some cases, however, you will need to intervene and update the statistics manually. Microsoft recommends that you update statistics under the following circumstances:<\/p>\n<ul>\n<li>\n<p class=\"MsoListParagraphCxSpFirst\">Your queries are running slower than you would expect.<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpMiddle\">You insert data into ascending or descending key columns such as <code>IDENTITY<\/code> columns.<\/p>\n<\/li>\n<li>\n<p class=\"MsoListParagraphCxSpLast\">You perform certain maintenance operations, such as truncating a table or performing a bulk insert.<\/p>\n<\/li>\n<\/ul>\n<p>For example, suppose we add another chunk of data to the <code>AWSales<\/code> table, using the following <code>INSERT<\/code> statement:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">INSERT INTO dbo.AWSales \r\n\u00a0 (SalesOrderID, OrderDetailID, ProductID, LineTotal)\r\nSELECT SalesOrderID, SalesOrderDetailID, ProductID, LineTotal\r\nFROM AdventureWorks2014.Sales.SalesOrderDetail\r\nWHERE SalesOrderDetailID &gt;= 60000;\r\n\r\n<\/pre>\n<p>If we were to run the following <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement again, we would receive the same results as when we ran it in the previous section:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">DBCC SHOW_STATISTICS (AWSales, TotalStats);\r\n<\/pre>\n<p>On my system, the <code>Updated<\/code> column in the header results showed the date as July 9, 2016 and the time as 2:02 PM, a time before I ran the preceding <code>INSERT<\/code> statement, indicating that the histogram remained unchanged, despite the addition of over 60,000 rows.<\/p>\n<p>To make certain our queries have the most up-to-date statistics, we can run an <code>UPDATE<\/code> <code>STATISTICS<\/code> statement against the <code>AWSales<\/code> table:<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">UPDATE STATISTICS dbo.AWSales\r\nWITH FULLSCAN;\r\n\r\n<\/pre>\n<p>I&#8217;ve included the <code>WITH<\/code> <code>FULLSCAN<\/code> clause to tell SQL Server to scan the entire column in each statistics object when updating the statistics. By default, SQL Server uses only a sample of the data, rather than calculating the statistics based on all data. A full scan normally provides more accurate statistics, but it can result in more resource-intensive operations.<\/p>\n<p>After updating the statistics, we can re-run the <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement, which gives us the results shown in the following figure.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/imported\/2444-clip_image014.gif\" alt=\"2444-clip_image014.gif\" width=\"623\" height=\"262\" \/><\/p>\n<p>Notice that the time has been updated and that the histogram steps define slightly different ranges. For a small data set, these differences might not seem like much, but when you start talking billions of rows, updating statistics after certain operations can be integral to maintaining your query&#8217;s performance.<\/p>\n<h1>More on statistics<\/h1>\n<p>There&#8217;s a lot more to statistics than what we&#8217;ve covered here, but this should give you a general sense of how you can start working with them if you find your query performance lagging. The key is in knowing where to look and how to correlate what you find to the optimizer&#8217;s execution plans.<\/p>\n<p>To this end, the <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement will likely be one of your best tools, along with the ability to view execution plans in SSMS. However, also be sure to refer to SQL Server documentation for more details about the <code>DBCC<\/code> <code>SHOW_STATISTICS<\/code> statement, the <code>sys.stats<\/code> and <code>sys.stats_columns<\/code> system views, and distribution statistics in general. The better you understand how statistics work, the better you can optimize your queries to ensure their maximum performance.<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Distribution statistics are used by SQL Server&#8217;s Query Optimiser to determine a good execution plan for your SQL query. You don&#8217;t need to know about them to execute queries, but the better you understand them, the better you can optimise your queries and sort out performance problems. Robert Sheldon once more provides a simple guide.&hellip;<\/p>\n","protected":false},"author":221841,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143529],"tags":[5842],"coauthors":[6779],"class_list":["post-54563","post","type-post","status-publish","format-standard","hentry","category-performance-sql-server","tag-sql-monitor"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/54563","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/221841"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=54563"}],"version-history":[{"count":6,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/54563\/revisions"}],"predecessor-version":[{"id":94199,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/54563\/revisions\/94199"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=54563"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=54563"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=54563"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=54563"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}