We have all been using the
COUNT(DISTINCT) function to return row counts for a distinct list of column values in a table for ages. With the introduction of SQL Server 2019, there is a new way to get an estimate of distinct row values and counts for a table. It is by using the new
APPROX_COUNT_DISTINCT() function. This new function doesn’t return the actual number of rows for each distinct value in a table, but instead returns an approximate count for each distinct value. This new function uses fewer resources than the tried and true
COUNT(DISTINCT) function. Let’s take a little closer look at this new function.
What Problem is APPROX_COUNT_DISTINCT() Trying to Solve?
APPROX_COUNT DISTINCT() function is trying to solve the “Count Distinct Problem.” More information about this problem can be found here. Basically, the problem is that counting distinct values requires more and more memory as the number of distinct values increases. At some point, SQL Server can no longer manage to maintain the count of distinct values in memory and must spill to tempdb. Spilling to tempdb causes overhead and leads to increased execution time.
The implementation of
APPROX_COUNT_DISTINCT() has a much smaller memory requirement than the
COUNT(DISTINCT) function. The algorithm used for this new function is HyperLogLog. This algorithm can estimate the number of distinct values of greater than 1,000,000,000, where the accuracy of the calculated approximate distinct count value is within 2% of the actual distinct count value. It can do this while using less than 1.5 KB of memory.
More Information About the APPROX_COUNT_DISTINCT Function
APPROX_COUNT_DISTINCT function has been available in other vendor software packages for a while now, but it is new in SQL Server. It first became available in Azure SQL Database, and now has been released for on premises use with SQL Server 2019 Community Technical Preview (CTP) 2.0. This new function is design to quickly return the approximate number of unique non-null values in a group, when your data set has millions, or billions of rows, and many distinct values.
APPROX_COUNT_DISTINCT() function is designed to give you the approximate aggregated counts more quickly than using the
COUNT (DISTINCT) method. It does this by using the HyperLogLog algorithm, which reduces the memory footprint required to maintain a list of distinct values. Because less memory is needed,
APPROX_COUNT(DISTINCT) is, therefore, less likely to spill to tempdb when aggregating millions or billions of rows of data. This equates to shorter execution times compared to using the
The documentation states that this new function “guarantees up to a 2% error rate within a 97% accuracy.” Therefore, if speed is more important than absolute accuracy of the distinct count, then you might want to consider using this new function.
Invoking this function is similar to using the
COUNT(DISTINCT) function. Here is the syntax:
APPROX_COUNT_DISTINCT ( expression )
The expression can be any type, except for
Testing APPROX_COUNT_DISTINCT() for Performance
In order to test out the performance of this new function, I’ll use a specific use case to determine out how fast
APPROX_COUNT_DISTINCT() runs as compared to
COUNT(DISTINCT. In addition to verifying the speed, I’ll also look to see if I can detect if this new function uses a smaller memory footprint than the old
To gather and compare these metrics, I’m going to use this specific hypothetical use case and situation:
Develop a dashboard indicator that shows whether the number of unique IP addresses that have accessed our site for the current month is more, less or the same as the prior month.
I work at an internet search engine, and I am responsible for building queries to run against some of the data we have collected. One of the things we track are the IP addresses for every visit to our site. We collect billions of rows of data every month. My boss has asked me to develop a query that can be used as a dashboard trend indicator that will show whether the number of unique IP Addresses coming to our site has gone up (+) or down (-) in the current month, as compared to the prior month. My boss requires this new query to run as fast as possible across our IP Address tracking table that contains billions, and billions of rows. My boss is willing to accept some level of lesser precision in the accuracy of the number, provide the indicator and percentage of change can be produced more quickly than a method that has absolute accuracy.
To create a testing environment, I created a database and populated a table named Visits2 that has a billion rows of randomly generated data. You can find the code I used to create my sample database in Listing 3, at the end of this article.
First, I want to test how fast the
APPROX_COUNT_DISTINCT() function runs as compared to the
COUNT(DISTINCT) function. For this test, I ran two
SELECT queries that supported my use case. The first
SELECT statement uses the
COUNT(DISTINCT)function, and the second
SELECT statement uses the
APPROX_COUNT_DISTINCT() function. This code, found in Listing 1, generates a count of the number of unique IP Address that have accessed our site for July and August. You can also turn on “live query statistics” if you wish to monitor the progress of each
SELECT statement while they are running.
SET STATISTICS TIME ON;
SELECT COUNT(DISTINCT IP_Address) AS NumOfIP_Addresses, DATEPART (month,VisitDate) AS MonthNum
WHERE DATEPART (month,VisitDate) IN (7, 8)
GROUP BY DATEPART (month,VisitDate);
SELECT APPROX_COUNT_DISTINCT(IP_Address) AS NumOfIP_Addresses, DATEPART (month,VisitDate) AS MonthNum
WHERE DATEPART (month,VisitDate) IN (7, 8)
GROUP BY DATEPART (month,VisitDate);
This code turns on time statistics and runs each
SELECT statement. I will use the time statistics produced by running this code to determine how much CPU and elapsed time it takes to get distinct IP address counts for July and August. In Results 1, you can see the CPU and elapsed times I got when I ran the code in Listing 1.
In Results 1, I show the timing results of the output produced by setting the
STATISTICS TIME ON, for the two
SELECT statements. The first set of times are from the
COUNT(DISTINCT) query, whereas the second set of times are from the
APPROX_COUNT_DISTINCT query. As you can see,
APPROX_COUNT_DISTINCT used less CPU and elapsed time compared to the
APPROX_COUNT_DISTINCT() function ran a little over 6% faster than the
COUNT(DISTINCT) function and uses almost 24% less CPU.
To determine the number of memory grants used by each SELECT statement, I used the output of the execution plan. In Results 2 you can see the execution plan from these two SELECT queries.
The first thing you might notice when comparing these two executions is that the Cost of the Hash Match operation for the
COUNT(DISTINCT) function takes more than twice as long to run as the Hash Match operation for the
When I hover over the Hash Match operator for the
COUNT(DISTINCT) function, I can see it spilled to disk (see Results 3), but when I hover over the Hash Match operator for the
APPROX_COUNT_DISTINCT function, it did not spill to disk (see Results 4).
To display the number of memory grants between the two different functions, I hover over the SELECT icon in the execution plan. When I hover over the COUNT(DISTINCT) SELECT icon I get the results in Results 5, and when I hover over the APPROX_COUNT_DISTINCT SELECT icon I get the results in Result 6.
As you can see, the
COUNT(DISTINCT) function required 1,847,736 memory grants, whereas the
APPROX_COUNT_DISTINCT() function only required 24,776 memory grants. This means the
COUNT(DISTINCT) took 74 times as many memory grants over the new
APPROX_COUNT_DISTINCT() function. Based on this, you can see the
APPROX_COUNT_DISTINCT() function had a significantly smaller number of additional memory grants required versus the
Testing APPROX_COUNT_DISTINCT() for Accuracy
In addition to performance testing, I also wanted to test the accuracy/precision of distinct values produced between the
COUNT(DISTINCT) and the
APPROX_COUNT_DISTINCT() functions. To determine the precision difference between the two functions, I compared the number of distinct values produced by the two different
SELECT statements in Listing 1. The
COUNT(DISTINCT) function returned a count of 16,581,375 distinct IP address for July, whereas the
APPROX_COUNT_DISTINCT() function returns a count of 17,075,480 for July. By comparing these two numbers, I can see that the
APPROX_COUNT_DISTINCT() function is off by a little over 2.97%. Since this number was not within 2% error rate as documented, I contact Microsoft about this.
An engineer at Microsoft told me that the SQL Server engine uses a “sophisticated analytic sampling method” to produce an estimated count, and because of this, it is possible to build a contrived table that throws off the sampling method and produces dramatically inaccurate results.
From this first test, I can see that the
APPROX_COUNT_DISTINCT() function does, in fact, run faster and requires fewer resources over the
COUNT(DISTINCT) function. The only problem I ran across with this first test, was that I had more than a 2% precision error when generating the approximate count, over the real distinct count value. In my case, I found the precision error was 2.97%, which according to documentation can occur.
Since my initial testing found the
APPROX_COUNT_DISTINCT() function returned a distinct count that was more than 2% different than the actual distinct count, I decided to run some additional precision testing. The additional
SELECT statements I will be testing can be found in Listing 2.
SELECT COUNT(DISTINCT ID) AS COUNT_ID,
COUNT(DISTINCT I100) AS COUNT_I100,
COUNT(DISTINCT I1000) AS COUNT_I1000,
COUNT(DISTINCT I10000) AS COUNT_I10000,
COUNT(DISTINCT I1000000) AS COUNT_I1000000,
COUNT(DISTINCT I10000000) AS COUNT_I10000000
SELECT APPROX_COUNT_DISTINCT(ID) AS COUNT_ID,
APPROX_COUNT_DISTINCT(I100) AS COUNT_I100,
APPROX_COUNT_DISTINCT(I1000) AS COUNT_I1000,
APPROX_COUNT_DISTINCT(I10000) AS COUNT_I10000,
APPROX_COUNT_DISTINCT(I1000000) AS COUNT_I1000000,
APPROX_COUNT_DISTINCT(I10000000) AS COUNT_I10000000
The code in Listing 2 will identity the number of unique counts for columns that have a different number of unique values. I will use this code in Listing 2 to determine the number of different values it takes for the
APPROX_COUNT_DISTINCT function to produce a count which is different than the actual number of unique values produced by the
COUNT(DISTINCT) function. The code from Listing 2 produces the output found in Results 7.
As you can see by looking at Results 7, the
APPROX_COUNT_DISTINCT() function produces the same number as compared to the COUNT(DISTINCT) function when the number of unique values is low. In my case less than 1,000 unique values. But for each column that had more than 100 values, the
APPROX_COUNT_DISTINCT() function produced an approximate number was more or less than the actual number generated by the
COUNT(DISTINCT) function. This test shows that when the actual number of distinct values in a table increases, the
APPROX_COUNT_DISTINCT() functions don’t produce the same number of distinct values as the COUNT() function, however, it is close. Therefore, if your table contains a large number of distinct values, then expect this new function to produce a count value that is somewhat different than the actual number of distinct values.
Conclusions Based on My Initial Testing
From the few simple tests I ran here against my generated sample data, I have made the following observations:
APPROX_COUNT_DISTINCT()function does return a distinct count number faster than the
COUNT(DISTINCT)function requires more memory grants than the
- In one situation, I able to returns a unique count value that had more than a 2% error rate.
- If the number of distinct values for a given column is low, then the
APPROX_COUNT_DISTINCT()function returns the same count value as the
Should you start using the
APPROX_COUNT_DISTINCT()function? That’s a good question. I suppose it depends on your situation. I was pleasantly surprised at how much I was able to improve the runtime and reduce the memory footprint using the
APPROX_COUNT_DISTINCT() function over the
COUNT(DISTINCT) function. But I was little concerned how easy it was for me to retrieve an aggregated count value that had a precision error of more than 2%. My Visits2 table was very contrived but did expose the improvement and drawbacks of using the
APPROX_COUNT_DISTINCT() function. Therefore, before using this new function in a production environment, I would suggest developing a more exhaustive set of tests against real data to verify that any precision errors are within an acceptable range for each situation. As with any new feature, I suggest you identify what the new feature does and then test the heck out of it to verify it provides the improvements you desire in your environment.
Code Used to Create Sample Test Data
In Listing 3, you will find the code I used to create my contrived sample database (SampleData) and the table (Visits2) I used to support my testing. Keep in mind that the code in this listing is not very efficient at creating my sample data. It took over an hour on my laptop to create the billion-row table I used for testing.
-- Create Sample DB with SIMPLE recovery
CREATE DATABASE SampleData
CONTAINMENT = NONE
( NAME = N'SampleData', FILENAME = N'D:\SQL Server 2019\SampleData.mdf' , SIZE = 65GB, FILEGROWTH = 5GB)
( NAME = N'SampleData_log', FILENAME = N'D:\SQL Server 2019\SampleData_log.ldf' , SIZE = 5GB, FILEGROWTH = 5GB);
ALTER DATABASE SampleData SET RECOVERY SIMPLE;
-- Set database context
-- Create table to hold sample data
CREATE TABLE Visits2
-- Create Tally Table
CREATE VIEW vw_Tally AS
--Itzik style tally table
WITH lv0 AS (SELECT 0 g UNION ALL SELECT 0)
,lv1 AS (SELECT 0 g FROM lv0 a CROSS JOIN lv0 b) -- 4
,lv2 AS (SELECT 0 g FROM lv1 a CROSS JOIN lv1 b) -- 16
,lv3 AS (SELECT 0 g FROM lv2 a CROSS JOIN lv2 b) -- 256
,lv4 AS (SELECT 0 g FROM lv3 a CROSS JOIN lv3 b) -- 65,536
,lv5 AS (SELECT 0 g FROM lv4 a CROSS JOIN lv4 b) -- 4,294,967,296
,Tally (n) AS (SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM lv5)
SELECT TOP (1000000) n
ORDER BY n;
-- Populate Visits2 with sample data
SET NOCOUNT ON;
DECLARE @Max bigint = (select ISNULL(max(ID),0) From Visits2);
WHILE @Max < 1000000000 BEGIN
WITH TallyTable AS (
SELECT n + @Max as N,
CAST(RAND(CHECKSUM(NEWID())) * 255 as INT) + 1 AS A4,
CAST(RAND(CHECKSUM(NEWID())) * 255 as INT) + 1 AS A3,
CAST(RAND(CHECKSUM(NEWID())) * 255 as INT) + 1 AS A2,
1.0 + floor(1 * RAND(convert(varbinary, newid()))) AS A1,
DATEADD(DD, 1.0 + floor(62 * RAND(convert(varbinary, newid()))),'2018-07-01') AS VisitDate
INSERT INTO Visits2 (ID, I100, I1000, I10000, I100000, I1000000, I10000000, IP_Address, VisitDate)
SELECT n, n%100, n%1000, n%10000, n%100000, n%1000000, n%10000000,
CAST(A1 AS VARCHAR) + '.' + CAST(A2 AS VARCHAR) +
'.' + CAST(A3 AS VARCHAR) +
'.' + CAST(A4 AS VARCHAR), VisitDate
set @Max = (select ISNULL(max(ID),0) From Visits2);
-- CREATE UNIQUE CLUSTERED INDEX
CREATE CLUSTERED INDEX IX_SampleData_Visits2_ID ON dbo.Visits2(ID);