SQL Server Grouping Workbench

A gentle lesson about GROUP BY on the Nursery Slopes develops gradually into a wild ride off-piste amongst the pine-trees.

This Workbench is about using the GROUP BY clause in a SQL Statement. It is designed so it can be pasted straight into SSMS or the Query Analyser, though to run the examples, you’ll need the data file too. Just click the “CODE DOWNLOAD” link at the bottom of this article. Also included in the download bundle is the original SQL file, if you need it.

Contents

Introduction

Like all these workbench articles, the whole point is to try different things out, improve on the code, and learn in a practical way. If you’re like us, and find you can take in information only after real practice doing things for real then this article is for you.

If you think that it is too elementary, move to the end of the article and test out your knowledge by answering all the questions (Yes, the editor is giving away prizes for the best answers he receives. Insist on the glorious USB Red-Gate pendant).

Aggregations in SQL are very simple to do, but can cause a lot of confusion due to the word ‘Group’. The result you get from a grouping does not consist of the individual records but a collection or ‘aggregation’ of records.

A pack of playing cards is a group of 52 playing cards. If you were to group a pack of cards by colour, you’d have two ‘rows’ of 26 cards; if you grouped by suit, there would be four ‘rows’ with 13 in each. You could group them by their rank/court attribute, or in other ways. The point is that the result is always an aggregate, even if there is only one member in the collection.

Each row in the result of a GROUP BY represents a collection, or grouping. You can get the values for the number of records in each collection, or grouping, or the sum, average, and so on of numeric values, or the max or min of strings, but to ask for individual values makes no sense if you haven’t grouped by that value.

Sometimes, you will use an aggregate function in a SELECT Statement without using a GROUP BY clause. There is an implicit grouping of the entire result into one row, which is why you can only use aggregate functions in such a case, and will get one row returned.

GROUP BY statements are the bedrock of reporting in SQL.

The built-in aggregate functions are…

Common ones

Statistical functions

Checksum functions

(You can use sum and avg with numeric columns only – int, smallint, TinyIntBigint decimal, numeric, float, SmallMoney and money. You cannot use min and max with bit datatypes. You cannot use aggregate functions other than count(*) with text and image datatypes.)

We’ll start with some simple manoevres and then finish off with a few wheelies and handbrake-turns.

As sample data, we’ll take a list of all oil-producing countries in the world and their oil-production.

Create a new database or use your ‘practice’ development Database and then create this.

And we’ll get our data in. Download the data file, and load the data as follows:

Removing Duplicates

Before we get stuck into doing reporting, I’ll illustrate how simple it is to remove duplicates from tables using grouping.

Of course, if you are going to remove duplicates, you will really want to move them to another table for forensic analysis to find out how it happened and whether they are valid entries. Here is our Duplicates table:

Let’s maliciously create some duplicates!

You can then easily see these duplicates by:

Notice that we use a HAVING clause. This is because we are selecting only those rows that contain more than one row in the original table (we could have checked only part of the original table by using the WHERE clause. A WHERE clause cannot contain an aggregate function of course, but the HAVING clause can.

So we move them to another table to check them to make sure. We wrap this up in a transaction just in case anything goes wrong, as we don’t want to lose a record that might turn out to be legit.

Now look in the duplicate table:

And check the main table for duplicates with the SQL I’ve already given.

Simple Reporting

What about getting a summary of oil production per continent?

We are grouping by continent and get back seven records, each of which represents a collection of oil producers in the same continent.

…or by region ordered by production?

Grand totals are easy:

We used the grouping() function to order the total at the end we used WITH ROLLUP to compute the total. It can be very useful for running sums and running averages.

…but someone is bound to ask for just the top five:

…or maybe the bottom five!

Or, more tiresomely, for the top five, the others as an ‘Others’ row, and the sum total!

And then there will be the guy who wants a list of just those regions who produce more than the average.

Or possibly the five countries closest to the average:

Or conceivably, region closest to the average

Grouping and Pivot Tables

So let’s get a bit harder and do a pivot table relating regions and continents with column and row totals.

The ‘order by grouping’ trick ensures that the total comes in the right place on the last row!

Enumerating the Group members

This is all OK but a lot of people want, and expect, to have a list of all the constituents of their grouping in the result. They don’t like DBAs laughing and saying it isn’t possible. There are now CLR Aggregation routines that do it, but here is a method that works on SQL 2000. Not a cursor in sight! (it works a lot better in 2005 with VaRCHAR(MAX)).

Now this technique could get quite interesting if combined with ‘ranging’ where you can impose categories onto the data of your choice (Date ranges are very common).

Ranging

By ranging, I mean imposinbg aribrary value ranges, and grouping by them you can, of course, use a helper table to do this much more elegantly.

The code above is much more efficient than it looks, but why not make a User-defined function to do it?

So let’s combine ranging and enumeration!

We can list all the countries that fall in each range category….

Grouping as a utility

One can use grouping in a variety of ways that have nothing to do with reporting . Here is an example of using GROUP BY to help produce a table by splitting a delimited list. It needs the Number Helper Table (You’ll have to refer to the ‘Helper Table Workbench to find out about that).

Questions

Send your answers to editor@simple-talk.com. The first three correct entries will receive a much-coveted Simple-Talk goodie bag (polo shirt, USB key, bottle opener, beermat, pen).

  1. How would you get the count of the number of rows in a table with NULLS in a particular column, using GROUP BY, but without using a COALESCE or CASE statement?
  2. What is the GROUPING() function useful for?
  3. Can a WHERE clause contain an aggregate function?
  4. When would you need to use a HAVING clause?
  5. What does the ROLLUP do? How would you use it?
  6. Can you use UDFs (user-defined scalar-valued functions) in GROUP BY clauses?

See also other Workbenches at Simple-Talk

Robyn Page’s SQL Server DATE/TIME Workbench, Robyn Page Date calculation and formatting in SQL Server can be surprisingly tricky. Robyn Page’s “hands-on” workbench will lead you through the minefield.

Robyn Page’s SQL Server String Manipulation Workbench, Robyn Page String searching and manipulation in SQL Server can be error-prone and tedious…unless you’re armed with the techniques described in Robyn’s string manipulation workbench…

SQL Server Error Handling Workbench, Grant Fritchey Grant Fritchey steps into the workbench arena, with an example-fuelled examination of catching and gracefully handling errors in SQL 2000 and 2005, including worked examples of the new TRY..CATCH capabilities.

Robyn Page’s SQL Server Cursor Workbench, Robyn Page The topic of cursors is the ultimate “hot potato” in the world of SQL Server. Everyone has a view on when they should and mainly should not be used. By example and testing Robyn Page proves that, when handled with care, cursors are not necessarily a “bad thing”.

Robyn Page’s SQL Server Data Validation Workbench, Robyn Page Robyn Page provides essential techniques for ensuring the validity of the data being entered into your SQL Server tables.

Robyn Page’s Excel Workbench, Robyn Page and Phil Factor The need to produce Excel reports from SQL Server is very common. Here, Robyn Page and Phil Factor present practical techniques for creating and manipulating Excel spreadsheets from SQL Server, using linked servers and T-SQL. The pièce de résistance is a stored procedure that uses OLE Automation…

Robyn Page’s SQL Server Security Workbench, Robyn Page and Phil Factor Robyn Page and Phil Factor present practical T-SQL techniques for controlling access to sensitive information within the database, and preventing malicious SQL injection attacks.

SQL Server Helper Table Workbench, Robyn Page and Phil Factor Cursors and iterations are both renowned for slowing down Transact SQL code, but sometimes seem unavoidable. In this workbench, Robyn Page and Phil Factor demonstrate some set-based techniques for string manipulation and time interval-based reporting, which use helper tables instead.