SQL Server Grouping Workbench

This Workbench is about using the GROUP BY clause in a SQL Statement. It is designed so it can be pasted straight into SSMS or the Query Analyser, though to run the examples, you’ll need the data file too. Just click the “CODE DOWNLOAD” link at the bottom of this article. Also included in the download bundle is the original SQL file, if you need it.

Introduction
Removing Duplicates
Simple Reporting
Grouping and Pivot Tables
Enumerating the Group members
Ranging
Grouping as a utility
Questions

Introduction

Like all these workbench articles, the whole point is to try different things out, improve on the code, and learn in a practical way. If you’re like us, and find you can take in information only after real practice doing things for real then this article is for you.

If you think that it is too elementary, move to the end of the article and test out your knowledge by answering all the questions (Yes, the editor is giving away prizes for the best answers he receives. Insist on the glorious USB Red-Gate pendant).

Aggregations in SQL are very simple to do, but can cause a lot of confusion due to the word ‘Group’. The result you get from a grouping does not consist of the individual records but a collection or ‘aggregation’ of records.

A pack of playing cards is a group of 52 playing cards. If you were to group a pack of cards by colour, you’d have two ‘rows’ of 26 cards; if you grouped by suit, there would be four ‘rows’ with 13 in each. You could group them by their rank/court attribute, or in other ways. The point is that the result is always an aggregate, even if there is only one member in the collection.

Each row in the result of a GROUP BY represents a collection, or grouping. You can get the values for the number of records in each collection, or grouping, or the sum, average, and so on of numeric values, or the max or min of strings, but to ask for individual values makes no sense if you haven’t grouped by that value.

Sometimes, you will use an aggregate function in a SELECT Statement without using a GROUP BY clause. There is an implicit grouping of the entire result into one row, which is why you can only use aggregate functions in such a case, and will get one row returned.

GROUP BY statements are the bedrock of reporting in SQL.

The built-in aggregate functions are…

Common ones

sum([all|distinct] expr.) Total of the (distinct) values in the expression

avg([all|distinct] expr.) Average of the (distinct) values in the expression

count([all|distinct] expr.) Number of (distinct) non-null values in the expression

count(*) Number of selected rows

max(expr.) Highest value in the expression

min(expr.) Lowest value in the expression

Statistical functions

stdev(expr.) The standard deviation of the values in the expression

Stdevp(expr.) Population standard deviation of the expression values

Var(expr.) The Variance of the expression values

Varp(expr.) The population Variance of the expression values

Checksum functions

Binary_checksum(expr.) Returns the binary checksum value

Checksum(expr.) Returns the checksum value

Checksum(*) checksum is over all the columns of the table

Checksum_agg

([all|distinct] expr.) Returns the checksum of the NON-NULL values in a group.

(You can use sum and avg with numeric columns only – int, smallint, TinyIntBigint decimal, numeric, float, SmallMoney and money. You cannot use min and max with bit datatypes. You cannot use aggregate functions other than count(*) with text and image datatypes.)

We’ll start with some simple manoevres and then finish off with a few wheelies and handbrake-turns.

As sample data, we’ll take a list of all oil-producing countries in the world and their oil-production.

Create a new database or use your ‘practice’ development Database and then create this.

CREATE TABLE [dbo].[OilProducers](

[[OilProducer_ID] [int] IDENTITY(1,1) NOT NULL,

[country] [varchar](30) NOT NULL,

[BarrelsPerDay] [int] NOT NULL,

[Continent] [varchar](80) NOT NULL,

[Region] [varchar](80) NOT NULL

) ON [PRIMARY]

And we’ll get our data in. Download the data file, and load the data as follows:

-- substitute the full path to your file for the name 'OilProducers.txt'

BULK INSERT dbo.OilProducers

FROM ' OilProducers.txt'

/* or

Declare @Command varchar(8000)

select @Command='bcp '

+db_name()

+'.dbo.OilProducers in OilProducers.txt -c -S '+@@Servername+' -E -T'

execute xp_cmdshell @command--assumes Windows Security

Removing Duplicates

Before we get stuck into doing reporting, I’ll illustrate how simple it is to remove duplicates from tables using grouping.

Of course, if you are going to remove duplicates, you will really want to move them to another table for forensic analysis to find out how it happened and whether they are valid entries. Here is our Duplicates table:

CREATE TABLE [dbo].[Duplicates](

[OilProducer_ID] [int] IDENTITY(1,1) NOT NULL,

[Duplicate_ID] [int],

[country] [varchar](30) NOT NULL,

[BarrelsPerDay ] [int] NOT NULL,

[Continent] [varchar](80) NOT NULL,

[Region] [varchar](80) NOT NULL

) ON [PRIMARY]

Let’s maliciously create some duplicates!

INSERT INTO OilProducers(Country,BarrelsPerDay,continent, region)

SELECT Country,BarrelsPerDay,continent, region

FROM OilProducers WHERE country LIKE 'A%'

INSERT INTO OilProducers(Country,BarrelsPerDay ,continent, region)

SELECT Country,BarrelsPerDay,continent, region

FROM OilProducers WHERE country BETWEEN 'A' AND 'E'

You can then easily see these duplicates by:

SELECT 'First_ID'=MAX(OilProducer_ID), 'No.InTable'=COUNT(*)

FROM OilProducers

GROUP BY country

HAVING COUNT(*)>1

Notice that we use a HAVING clause. This is because we are selecting only those rows that contain more than one row in the original table (we could have checked only part of the original table by using the WHERE clause. A WHERE clause cannot contain an aggregate function of course, but the HAVING clause can.

So we move them to another table to check them to make sure. We wrap this up in a transaction just in case anything goes wrong, as we don’t want to lose a record that might turn out to be legit.

-- start of routine

DECLARE @found INT

DECLARE @Errorhappened INT

DECLARE @OtherErrorhappened INT

BEGIN TRANSACTION

WHILE COALESCE(@found,1)>0

AND COALESCE(@ErrorHappened,0) =0

AND COALESCE(@OtherErrorHappened,0)=0

BEGIN

INSERT INTO Duplicates (Duplicate_ID,country,BarrelsPerDay,continent,region)

SELECT OilProducer_ID,country,BarrelsPerDay,continent,region

FROM OilProducers WHERE oilProducer_ID IN (

SELECT MAX(OilProducer_ID)

FROM OilProducers

GROUP BY country

HAVING COUNT(*)>1)

SELECT @Errorhappened=@@Error, @found=@@Rowcount

DELETE FROM OilProducers WHERE oilProducer_ID IN (

SELECT MAX(OilProducer_ID)

FROM OilProducers

GROUP BY country

HAVING COUNT( *)>1)--the HAVING clause has to be used as

--we have to select rows after the

--aggregation

SELECT @OtherErrorHappened=@@Error

END

IF @errorHappened<>0 or @OtherErrorHappened <>0 ROLLBACK TRANSACTION

ELSE COMMIT TRANSACTION

--end of routine

Now look in the duplicate table:

1	SELECT * FROM Duplicates

And check the main table for duplicates with the SQL I’ve already given.

Simple Reporting

What about getting a summary of oil production per continent?

SELECT [Barrels per day ]=SUM(BarrelsPerDay), continent

FROM OilProducers

GROUP BY continent --each row should be a continent

ORDER BY SUM(BarrelsPerDay) DESC --in descending order

We are grouping by continent and get back seven records, each of which represents a collection of oil producers in the same continent.

…or by region ordered by production?

SELECT [production]=SUM(BarrelsPerDay),

[average]=AVG(BarrelsPerDay),

Region,

[ Producers in region]=COUNT(*)

FROM OilProducers

GROUP BY region --each row is a collection of all the records for a region

ORDER BY SUM(BarrelsPerDay) DESC --order by production

Grand totals are easy:

SELECT

[continent]=COALESCE(continent,'Total (world production)'),

[Barrels per day]=SUM(BarrelsPerDay)

FROM OilProducers

GROUP BY continent WITH rollup--each row should be a continent

ORDER BY grouping(continent) ASC,SUM(BarrelsPerDay) DESC

We used the grouping() function to order the total at the end we used WITH ROLLUP to compute the total. It can be very useful for running sums and running averages.

…but someone is bound to ask for just the top five:

SELECT TOP 5 [production]=SUM(BarrelsPerDay),

[average]=AVG(BarrelsPerDay), Region, [Producers in region]=COUNT(*)

FROM OilProducers

GROUP BY region --each row is a collection of all the records for a region

ORDER BY SUM(BarrelsPerDay) DESC --order by production

…or maybe the bottom five!

SELECT * FROM

(SELECT TOP 5 [production]=SUM(BarrelsPerDay), [average]=AVG(BarrelsPerDay),

Region, [Producers in region]=COUNT(*)

FROM OilProducers

GROUP BY region--each row is a collection of all the records for a region

ORDER BY SUM(BarrelsPerDay) ASC)f --order by production

ORDER BY production DESC

Or, more tiresomely, for the top five, the others as an ‘Others’ row, and the sum total!

SELECT

[region]=CASE WHEN g.n=1 THEN t ELSE Region END,

[production]=CASE WHEN t='other'

THEN production -

(SELECT SUM(PRODUCTION)

FROM

(SELECT TOP 5 [production]=SUM(BarrelsPerDay)

FROM OilProducers

GROUP BY region

ORDER BY SUM(BarrelsPerDay) DESC)F)

ELSE production END

FROM (SELECT [n]=0,[t]='',[TheOrder]=1 --trick to duplicate the null row the

UNION ALL --'to get 'others' row

SELECT 1,'Other',2

UNION ALL SELECT 1,'Total',3)g

INNER JOIN

(SELECT TOP 6 [production]=SUM(BarrelsPerDay),--the basic data

[n]=grouping(region),

Region

FROM OilProducers

GROUP BY region WITH rollup--

ORDER BY SUM(BarrelsPerDay) DESC

ON g.n=f.n

ORDER BY g.TheOrder ASC, production DESC

And then there will be the guy who wants a list of just those regions who produce more than the average.

SELECT Region, [production]=SUM(BarrelsPerDay), [Producers in region]=COUNT(*)

FROM OilProducers

GROUP BY region-- each row is a collection of all the records for a region

HAVING SUM(BarrelsPerDay)>

(SELECT AVG(average) FROM

(SELECT [average]=AVG(BarrelsPerDay) FROM OilProducers GROUP BY region)f)

ORDER BY SUM(BarrelsPerDay) DESC --order by production

Or possibly the five countries closest to the average:

SELECT TOP 5 country, BarrelsPerDay, theAverage

FROM OilProducers

CROSS JOIN

(SELECT [theAverage]=AVG(BarrelsPerDay) FROM OilProducers)f

ORDER BY ABS(barrelsPerDay-theaverage) ASC

Or conceivably, region closest to the average

SELECT TOP 5 * FROM

(SELECT region, [sumBPD]=SUM(BarrelsPerDay)--, theAverage

FROM OilProducers GROUP BY region

CROSS JOIN

(SELECT [theAverage]=AVG(production) FROM

(SELECT [production]=SUM(BarrelsPerDay)

FROM OilProducers GROUP BY region

ORDER BY ABS(sumBPD-theaverage) ASC

Grouping and Pivot Tables

So let’s get a bit harder and do a pivot table relating regions and continents with column and row totals.

SELECT [region/continent ]=CONVERT(CHAR(25),COALESCE(region,'Total')),

'Asia'=SUM(CASE WHEN continent = 'Asia' THEN BarrelsPerDay ELSE 0 END),

'North A.'=SUM(CASE WHEN continent ='North America'

THEN BarrelsPerDay ELSE 0 END) ,

'Latin A.'=SUM(CASE WHEN continent ='Latin America and the Caribbean'

THEN BarrelsPerDay ELSE 0 END),

'Africa'=SUM(CASE WHEN continent ='Africa' THEN BarrelsPerDay ELSE 0 END),

'Russia'=SUM(CASE WHEN continent ='Russia' THEN BarrelsPerDay ELSE 0 END),

'Europe'=SUM(CASE WHEN continent ='Europe' THEN BarrelsPerDay ELSE 0 END),

'Oceania'=SUM(CASE WHEN continent ='Oceania' THEN BarrelsPerDay ELSE 0 END),

'sum'=SUM(BarrelsPerDay)

FROM OilProducers

GROUP BY region

WITH rollup

ORDER BY grouping(region),SUM(BarrelsPerDay) DESC

The ‘order by grouping’ trick ensures that the total comes in the right place on the last row!

Enumerating the Group members

This is all OK but a lot of people want, and expect, to have a list of all the constituents of their grouping in the result. They don’t like DBAs laughing and saying it isn’t possible. There are now CLR Aggregation routines that do it, but here is a method that works on SQL 2000. Not a cursor in sight! (it works a lot better in 2005 with VaRCHAR(MAX)).

DECLARE @list VARCHAR(8000)

SELECT @List=COALESCE(@list+',','') +'|'+region +'|'+ country+'|'+region +'|'

FROM OilProducers ORDER BY region

SELECT [region]=CONVERT(CHAR(26),region),

[Oil-producing countries]=LEFT(members,CHARINDEX(',|',members+',|')-1) FROM

(SELECT region, 'members'=

REPLACE(

SUBSTRING(list,CHARINDEX('|'+regions.region+'|',list),8000),

'|'+region+'|',

'')

FROM (SELECT 'list'=@List)f

CROSS JOIN (

SELECT region FROM oilproducers GROUP BY region

)regions

Results (Truncated to the right)

region Oil-producing countries

-------------------------- ----------------------------------------------------...

Australia and New Zealand Australia,New Zealand

Caribbean Virgin Islands,Cuba,Trinidad and Tobago,Aruba,Barbad...

Central America Belize,Nicaragua,Guatemala,Mexico

Eastern Africa Zambia,Madagascar

Eastern Asia Korea, North,Mongolia,Taiwan,Korea, South,China,Japan

Eastern Europe Czech Republic,Poland,Belarus,Ukraine,Hungary,Romani...

Melanesia Papua New Guinea

Middle Africa Congo,Chad,Equatorial Guinea,Gabon,Angola,Cameroon

North America Canada, United States

Northern Africa Algeria,Egypt,Libya,Tunisia,Sudan,Morocco

Northern Europe Finland,Sweden,Lithuania,Estonia,United Kingdom,Norw...

Russia Russia

South America Ecuador,Argentina,Brazil,Colombia,Venezuela,Peru,Sur...

South-central Asia Kyrgyzstan,Bangladesh,Pakistan,Tajikistan,Turkmenist...

South-eastern Asia Malaysia,Indonesia,Brunei,Vietnam,Thailand,Singapore...

Southern Africa South Africa

Southern Europe Italy,Croatia,Spain,Greece,Albania,Slovenia

Western Africa Sierra Leone,Ghana,Nigeria,Mauritania,Cote d'Ivoire

Western Asia Turkey,Bahrain,Yemen,Syria,United Arab Emirates,Saud...

Western Europe Switzerland,Belgium,Germany,Netherlands,France,Austr...

Now this technique could get quite interesting if combined with ‘ranging’ where you can impose categories onto the data of your choice (Date ranges are very common).

Ranging

By ranging, I mean imposinbg aribrary value ranges, and grouping by them you can, of course, use a helper table to do this much more elegantly.

-- Ranging can be by column

SELECT

[range]='No. of Countries',

[< 10,000 bpd]=SUM(CASE WHEN Barrelsperday<10000 THEN 1 ELSE 0 END),

[10,000 to 100,000]=SUM(CASE WHEN Barrelsperday BETWEEN 10000 AND 100000

THEN 1 ELSE 0

END),

[100,000 to 1m bpd]=SUM(CASE WHEN Barrelsperday BETWEEN 100001 AND 1000000

THEN 1 ELSE 0

END),

[1m to 10m bpd]=SUM(CASE WHEN Barrelsperday BETWEEN 1000001 AND 10000000

THEN 1 ELSE 0

END),

[more than 10m bpd]=SUM(CASE WHEN Barrelsperday > 10000000 THEN 1 ELSE 0 END)

FROM OilProducers

--or by row...

SELECT [category]=CASE WHEN Barrelsperday<10000 THEN '1/ > 10,000 bpd'

WHEN Barrelsperday BETWEEN 10000 AND 100000 THEN '2/ 10,000 to 100,000 '

WHEN Barrelsperday BETWEEN 100001 AND 1000000 THEN '3/ 100,000 to 1m bpd'

WHEN Barrelsperday BETWEEN 1000001 AND 10000000 THEN '4/ 1m to 10m bpd'

WHEN Barrelsperday > 10000000 THEN '5/ more than 10m bpd' END,

[total]=SUM(BarrelsperDay),

[No. of Countries]=COUNT(*)

FROM OilProducers

GROUP BY CASE WHEN Barrelsperday<10000 THEN '1/ > 10,000 bpd'

WHEN Barrelsperday BETWEEN 10000 AND 100000 THEN '2/ 10,000 to 100,000'

WHEN Barrelsperday BETWEEN 100001 AND 1000000 THEN '3/ 100,000 to 1m bpd'

WHEN Barrelsperday BETWEEN 1000001 AND 10000000 THEN '4/ 1m to 10m bpd'

WHEN Barrelsperday > 10000000 THEN '5/ more than 10m bpd' END

The code above is much more efficient than it looks, but why not make a User-defined function to do it?

CREATE FUNCTION OilOutputCategory

(

@OilOutput INT

)

RETURNS VARCHAR(30)

BEGIN

RETURN CASE WHEN @OilOutput<10000 THEN '1/ > 10,000 bpd'

WHEN @OilOutput BETWEEN 10000 AND 100000 THEN '2/ 10,000 to 100,000'

WHEN @OilOutput BETWEEN 100001 AND 1000000 THEN '3/ 100,000 to 1m bpd'

WHEN @OilOutput BETWEEN 1000001 AND 10000000 THEN '4/ 1m to 10m bpd'

WHEN @OilOutput > 10000000 THEN '5/ more than 10m bpd' END

END

--so the query becomes...

SELECT [category]=dbo.OilOutputCategory(BarrelsPerDay),

[total]=SUM(BarrelsperDay),

[producers]=COUNT(*)

FROM OilProducers

GROUP BY dbo.OilOutputCategory(BarrelsPerDay)

So let’s combine ranging and enumeration!

We can list all the countries that fall in each range category….

DECLARE @list VARCHAR( 8000)

SELECT @List=COALESCE(@list+',','')+catfudge

FROM (--the concatenation of the string didn't work until we did this!

SELECT TOP 1000 [catfudge]=

('|'+dbo.OilOutputCategory(BarrelsPerDay) +'|'+ country+'|'

+dbo.OilOutputCategory(BarrelsPerDay) +'|'),

[cat]=dbo.OilOutputCategory(BarrelsPerDay)

FROM OilProducers ORDER BY cat)f

SELECT [category]=CONVERT(CHAR(26),category),

[Oil-producing countries]=LEFT(members,CHARINDEX(',|',members+',|')-1) FROM

(SELECT category, 'members'=

REPLACE(

SUBSTRING(list,CHARINDEX('|'+ categories.category+'|',list),8000),

'|'+categories.category+'|',

'')

FROM (SELECT 'list'=@List)f

CROSS JOIN (

SELECT [category]=dbo.OilOutputCategory(BarrelsPerDay) FROM oilproducers

GROUP BY dbo.OilOutputCategory(BarrelsPerDay)

)categories

Results (Truncated to the right)

category Oil-producing countries

-------------------------- -------------------------------------------------...

1/ > 10,000 bpd Singapore,Burma,Finland,Taiwan,Ghana,Korea, South,...

2/ 10,000 to 100,000 Tunisia,Netherlands,Ukraine,Mauritania,France,Cuba...

3/ 100,000 to 1m bpd Syria,Sudan,Oman,Qatar,Romania,Italy,Malaysia,Chad...

4/ 1m to 10m bpd Venezuela,United Arab Emirates,United Kingdom,Indo...

5/ more than 10m bpd Saudi Arabia

Grouping as a utility

One can use grouping in a variety of ways that have nothing to do with reporting . Here is an example of using GROUP BY to help produce a table by splitting a delimited list. It needs the Number Helper Table (You’ll have to refer to the ‘Helper Table Workbench to find out about that).

CREATE FUNCTION [dbo].[uftSplitStringGroup]

(

@DelimitedList VARCHAR (8000),

@Delimiter VARCHAR(20)=','

)

RETURNS

@Results TABLE

(

SeqNo INT IDENTITY(1, 1),

Item VARCHAR(8000)

)

BEGIN

DECLARE @String VARCHAR(8000)

DECLARE @LenDelimiter INT

SELECT @String=@Delimiter+@DelimitedList+@Delimiter,

@LenDelimiter=LEN( @Delimiter)

INSERT INTO @results (item)

SELECT SUBSTRING(@string,

s1.number+@LenDelimiter,

MIN(s2.number)-s1.number-@LenDelimiter)

FROM (SELECT [string]= @String )f

CROSS JOIN numbers s1

INNER JOIN numbers s2 ON s1.number< s2.number

WHERE s1.number<=LEN (@string)

AND s2.number<=LEN (@string)

AND SUBSTRING(@string,s1.number,@LenDelimiter)=@Delimiter

AND SUBSTRING(@string,s2.number,@LenDelimiter)=@Delimiter

GROUP BY s1.number

RETURN

END

select * from uftSplitStringGroup('first,second,third',default)

select * from uftSplitStringGroup('<>this is something<>something else<>','<>')

select * from uftSplitStringGroup(

'Turkey,Bahrain,Yemen,Syria,United Arab Emirates,Saudi Arabia,Qatar,Kuwait',',')

Questions

Send your answers to editor@simple-talk.com. The first three correct entries will receive a much-coveted Simple-Talk goodie bag (polo shirt, USB key, bottle opener, beermat, pen).

How would you get the count of the number of rows in a table with NULLS in a particular column, using GROUP BY, but without using a COALESCE or CASE statement?
What is the GROUPING() function useful for?
Can a WHERE clause contain an aggregate function?
When would you need to use a HAVING clause?
What does the ROLLUP do? How would you use it?
Can you use UDFs (user-defined scalar-valued functions) in GROUP BY clauses?

Register for Simple Talk

SQL Server Grouping Workbench

Article tags

About the author

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases

About the author

SQL Server Matrix Workbench

Database Design Workbench – Keys

NET Performance Cribsheet

SQL Server Grouping Workbench

Contents

Introduction

Removing Duplicates

Simple Reporting

Grouping and Pivot Tables

Enumerating the Group members

Ranging

Grouping as a utility

Questions

See also other Workbenches at Simple-Talk

Article tags

Recommended

Subscribe for more

About the author

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases

About the author

SQL Server Matrix Workbench

Database Design Workbench – Keys

NET Performance Cribsheet