Summarizing data using GROUP BY and HAVING clauses

You can summarize data to get counts, averages, sums, and more using GROUP BY in T-SQL queries. Greg Larsen shows you how.

SQL Server provides the GROUP BY clause as a means to take a set of detailed values and summarize them into aggregated values. The detailed records can be aggregated based on one column, as well as expressions, which are specified in the GROUP BY clause. This article explores how to return summarized data using the GROUP BY and HAVING clauses.

Test data

In order to demonstrate how to use the GROUP BY and HAVING clauses, some test data will be needed. All of the examples in this article will use a table called Invoice that is created in tempdb, using the code in Listing 1.

Listing 1: Code to create test data

The code in Listing 1 created nine different invoice records in the Invoice table. Each row contains a single invoice record for one of three different ProductName’s: Thingamajig, Watchamacallit, and Dohicky. Each row has an InvDT, and an InvAmt, where the InvDT identifies the date the product was sold, and the InvAmt identifies the amount the product sold for. Each product sells for a different price, but each sale of a specific product sells for the same amount. The PaymentDT and PaymentAmt identify the date and amount each payment was made. A couple of the invoices have not been paid, hence why their PaymentDT and PaymentAmt are null.

A few different examples will be provided that use the test data in the Invoice table to demonstrate how the GROUP BY and HAVING clauses work. If you would like to run the example code provided in this article, you can use the code in Listing 1 to create the Invoice table on your instance of SQL Server.

Summarizing based on a single column

This first example shows how to write a GROUP BY clause that aggregates the rows in the Invoice table based on a single column grouping specification. This example calculates the total and average invoice amount for each product identified in the Invoice table by using the code in Listing 2.

Listing 2: Grouping based on a single column

Report 1 shows the results when Listing 2 is executed.

Report 1: Results when Listing 2 is run

Image showing 3 rows. ProductName, TotalInvAmt, AvgInvAmt. Dohicky, 226.77, 75.599999; Thingamajig, 363.33, 121.110000, Watchamacallit, 477.48, 159.160000

In Listing 2, the GROUP BY clause identifies only a single column named ProductName. This told SQL Server to group the rows by ProductName when calculating the TotalInvAmt and the AvgInvAmt. The TotalInvAmt for each of the different products was calculated using the SUM function, whereas the AvgInvAmt was calculated using the AVG function. Since there were only three different products sold in the Invoice table, only three summarized amounts were returned, one for each ProductName.

When a SELECT statement contains a GROUP BY clause, the columns in the select list can only be those columns identified in the GROUP BY clause, or column values calculated using an aggregate function. If there is no GROUP BY clause, then only aggregate functions can be added to the SELECT list and ORDER BY clause. In this case, the query will return one row and provide summary calculations over the entire result set. If a non-group by column is identified in the select list or ORDER BY clause, and it is not contained in an aggregate function, an error similar to the error in Report 2 will be produced.

Report 2: Error received when a column listed that is not contained in an aggregate function

The code Listing 3 was run to generate the error in Report 2.

Listing 3: Code containing selection list column not contained in aggregate function

Returning a subset of summarized values

If only a subset of the summarized rows needs to be returned, then a WHERE constraint can be used to identify the subset. The code in Listing 4 uses the WHERE clause to only return the summarized values for the ProductName Dohicky.

Listing 4: Constraining result set using WHERE clause

When adding the WHERE constraint to a SQL statement that contains a GROUP BY clause, the WHERE clause needs to be placed prior to the GROUP BY clause. The results shown in Report 3 are produced when the code in Listing 4 is executed.

Report 3: Output produced when Listing 4 is executed

An image showing 1 ros. ProductName, TotalInvAmt, AvgInvAmt, Dohicky, 226.77, 75.590000

Grouping based on multiple columns

There are times when data summarization needs to produce summarized values with more granularity than a single column. When this is needed, multiple columns can be specified in the GROUP BY criteria, as shown in Listing 5.

Listing 5: Summarizing based on ProductName and InvDT

In Listing 5, two columns, ProductName and InvDT, were identified in the GROUP BY clause. This tells SQL Server to summarize the test data based on each unique combination of these two column values. Report 4 contains the output created when Listing 5 is executed.

Report 4: Output when Listing 5 is run

An image showing 6 rows. ProductName, InvDT, TotalInvAmt, AvgInvAmt;  Dohicky, 2021-02-14, 151.18, 75.590000; Thingamajig, 2021-02-14, 363.33, 121.110000; Watchamacallit, 2021-02-14, 159.16, 159.160000;  Watchamacallit, 2021-03-03, 159.16, 159.160000; Dohicky, 2021-03-11, 75.59, 75.590000;  Watchamacallit, 2021-03-11, 159.16,	159.160000

Sorting summarized data

You may have noticed that in the prior example, the output in Report 4 was not sorted by ProductName and InvDT, even though the data was grouped and summarized by ProductName and InvDT. The ordering of the output is based on the unique combination of the grouped columns, and not the individual values of the columns in the GROUP BY clause. The only way to guarantee the output’s order is to have an ORDER BY clause. The code in Listing 6 groups the data the same as in Listing 5, but the output is produced in a different order because an ORDER BY clause has been added in Listing 6.

Listing 6: Sorting summarized data

When using the ORDER BY clause, it must be specified after the GROUP BY clause. Report 5 shows the output produced when the code in Listing 6 is executed.

Report 5: Output with Listing 6 is run

Same rows returned as last image, but this time they are sorted by ProductName and InvDt

By using the ORDER BY clause, the output in Report 5 is sorted by ProductName and InvDT.

Using an expression in the GROUP BY clause

Expressions can also be used in the GROUP BY clause to identify the values to summarize on. To show how an expression can be used in the GROUP BY clause, look at the code in Listing 7.

Listing 7: Using Expression in GROUP BY clause

The code in Listing 7 summarizes the rows in the Invoice table based on the year and month of the invoice. The year and month values are calculated from the InvDT using the following expression:

CAST(YEAR(InvDT) as CHAR(4)) + ‘-‘ + RIGHT(MONTH(InvDT) + 100,2);

This expression was specified on the GROUP BY clause to identify the values for summarizing. Remember that any column specified in the selection list either needs to be used with an aggregate function or by one of the GROUP BY values. Therefore, the expression identified can be included in the selection list, so the year and month values could be displayed in the result set, as shown in Report 6.

Report 6: Output when Listing 7 is run

An image showing 2 rows returned. YearMonth, TotalInvAmt, AvgInvAmt; 2021-02, 673.67, 112.278333; 2021-03, 393.91, 131.303333

Grouping NULL values

When comparing two different null values by design are not equal, but when grouping on a column that contain null values, the nulls are grouped together as if they are equal. To demonstrate, run the code in Listing 8.

Listing 8: GROUP ON column with null values

Report 7 shows results when Listing 8 is run.

Report 7: Output when Listing 8 is executed

An image showing 6 rows returned. PaymentDT,	MinPaymentAmt, NULL,	NULL; 2021-02-21,	75.16; 2021-03-01,	121.11; 2021-03-03, 75.16; 2021-03-13, 159.16; 2021-03-30, 75.16

There is a row in Report 7 that has NULL for PaymentDT column. This row was created when grouping together the two rows that don’t have a PaymentDT value. The value of NULL under the MinPaymentAmt column occurred because null values are ignored for all aggregate functions, except the COUNT(*) function.

Because nulls are grouped together, and most aggregate functions ignore null values, summarizing results using a GROUP BY clause might create some summarized values that some might consider inaccurate. To see how aggregate functions ignore nulls, review the rows with null values in the sample data by running the code in Listing 9.

Listing 9: Displaying sample data

Report 8 shows the output created when Listing 9 is executed.

Report 8: Output when Listing 9 is run

An image showing 3 rows returned. ProductName, PaymentAmt; Watchmacallit, 159.16; Watchamacallit, NULL; Watchamacallit, NULL

In Report 8, the PaymentAmt is null for two of the rows in the result set. To demonstrate how aggregate functions ignore nulls, consider the code in Listing 10.

Listing 10: Calculating average

When Listing 10 is executed, it returns the value 159.160000 for the AvgPaymentAmt. Is that the correct average amount? If you expected the average to be calculated based on three rows with a ProductName of Watchamacallit, then it is not correct. Because SQL Server ignores nulls when using the AVG function, it doesn’t consider the two rows with nulls when determining the average amount. Since nulls are ignored when summarizing the data, the average value of 159.16 is correct.

Using the GROUP BY clause in conjunction with the AVG function to calculate the average for each ProductName also ignores the null values, as can be seen by reviewing the output created when the code in Listing 11 is executed.

Listing 11: Calculating the average for each ProductName

Report 9 shows out output when Listing 11 is run.

Report 9: Output when Listing 10 is executed

An image showing 3 rows returned. ProductName, AvgPaymentAmt; Dohicky, 75.160000; Thingamajig, 121.110000; Watchmacallit, 159.160000

If the rows with null values need to be considered when calculating the average, then the nulls need to be converted to 0.00 prior to the AVG function being processed, as is shown in Listing 12.

Listing 12: Considering rows with a null value when calculating average

Report 10 shows the output when Listing 12 is executed.

Report 10: Output when Listing 12 is run

An image showing 3 rows returned. ProductName, AvgPaymentAmt; Dohicky, 75.160000; Thingamajig, 121.110000; Watchamacallit, 53.053333

The code in Listing 12 used the ISNULL function to convert null values for the PaymentAmt column to 0.00 prior to the AVG function being applied to calculate the average amount. By doing this, the AVG function considered all rows with ProductName Watchamacallit rows when calculating the AvgPaymentAmt.

HAVING clause

The HAVING clause is used to filter out summarized rows created by the GROUP BY clause. The WHERE clause is similar to the HAVING clause. The difference is that the WHERE clause filters out rows prior to them being summarized, whereas the HAVING clause filters the rows after they have been summarized into the groups identified in the GROUP BY clause. To demonstrate how the HAVING clause filters out summarized rows, consider the code in Listing 13.

Listing 13: Using the HAVING clause

Report 11 shows output when Listing 13 is run.

Report 11: Output when Listing 13 is executed

An image showing 2 rows returned. ProductName, AvgPaymentAmt; Dohicky, 75.16000; Thingamajig, 121.110000

The HAVING clause in Listing 13 only returned summarized rows that had an AvgPaymentAmt greater than 60. By comparing Report 11 to Report 10, you can see the Whatchamacallit row was removed by the HAVING clause.

Summarizing data using GROUP BY and HAVING clauses

This article covered how to take detail data and roll it up to produce summarized data using the GROUP BY and HAVING clauses. Data can be summarized based on a single column or multiple columns, and even expressions can be used. Null values cannot be compared because they are unknown values. But when using the GROUP BY clause, null values are treated as if they are the same value for aggregation purposes. Using the GROUP BY clause does cause SQL Server to sort the detail rows to perform aggregation, but only by including the ORDER BY clause can the output order be guaranteed. Next time you need to summarize, filter, and display aggregated data, you will know how to use the GROUP BY and HAVING clauses in conjunction with the WHERE and ORDER BY clauses to produce the rolled-up data you require.

If you liked this article, you might also like  The ins and outs of joining tables in SQL Server .