SQL GROUP BY Basics

It sometimes pays to go back and look at what you think you already know about SQL. Joe Celko gives a quick revision of the GROUP BY and HAVING clauses in SQL that are the bedrock of any sort of analysis of data, and comes up with some nuggets that may not be entirely obvious

The GROUP BY and HAVING clauses in SQL goes all the way back to the start of the language, in its simplest form. The original idea was that you would build a working table using the SELECT ... FROM ... WHERE clauses, and then partition that result set into groups. A group is defined as a set of rows with the same values for the grouping columns. The non-grouping columns had to be group properties (usually aggregate functions), formulas or computations done on these or constants (which is technically a group property).

Each group is then reduced down to a single row. The important concept here is the original table no longer exists. However, if you are old enough to have to work with the very early versions of SQL Server, the Sybase syntax did not get rid of the original table. This led to some problems. It exposed the original row data with the aggregated data.

The original HAVING clause was then applied to each of the rows in the new grouped table. This meant that it was done after the WHERE clause that built the original table.

Later we added Windows clauses, extensions to the GROUP BY in the form of ROLLUP, CUBE and GROUPING SETS. But there is still quite a bit you can do with the original simple GROUP BYHAVING constructs. They have the advantage of completing their work in a single table scan, while the newer constructs may need temporary storage or multiple passes through the data. The originals are portable and predictable. They are now well-optimized after all these decades.

 Basic Aggregate Functions

In the original SQL, we had a limited set of simple aggregate functions. Their general format was ‘<function name> ([DISTINCT| ALL] <expression>)“. The first thing these functions do is to remove all of the NULLs from the data. The function could be applied to an empty set, if only NULLs were returned. What is the sum or average of an empty set? Following the usual conventions of SQL, empty sets return NULL, not zero. Zero is an actual value; that was a very big debate in the Middle Ages, when Europeans started seeing zero showing up in those newfangled Hindu Arabic numerals

SQL has a convention that we can put keywords inside the parameters  in some of the intrinsic system functions. The aggregate functions have the option of DISTINCT or ALL; if nothing is given, then ALL is assumed.

SUM ([DISTINCT| ALL] <numeric expression>)
= arithmetic total of a numeric expression.

AVG ([DISTINCT| ALL] <numeric expression>)
= arithmetic average or mean of a numeric expression.

 MIN ([DISTINCT| ALL] <expression>)
= depending on the expression, the smallest numeric value, the least recent temporal value, or the lowest collation value in a string.

 MAX ([DISTINCT| ALL] <expression>)
= depending on the expression, the greatest numeric value, the temporal value for the wrist in the future, or the highest collation value in a string.

 COUNT ([DISTINCT| ALL] <expression>)
= the cardinality of the set of expression values, regardless of type.

 COUNT ([DISTINCT| ALL] *)
= the cardinality of the table, taken as a whole set.

The DISTINCT keyword says remove all redundant duplicates (i.e. keep one copy) from the set of expressions being aggregated. The ALL keyword says to retain all the rows (this is like UNION versus UNION ALL and so forth in other parts of the language). In practice, nobody actually uses the ALL option; in fact, a lot of the newer SQL programmers do not even know it exists.

It may come as a surprise that <expression> as a parameter to an aggregate function does not have to be just a simple column name. However, it does have some limitations, based on scoping. This is why you cannot nest simple aggregate functions inside each other. Think about SUM (AVG (x)) and what it means. First, we would compute the average. It would be a single value (or a NULL). The sum of a single value is just that single value (or a NULL).

Also, the expression has to be of an appropriate data type. The SUM () and AVG () are clearly numeric. It would not make much sense to take an average of a result set of dates or strings, would it? However, the scale and precision of the results are implementation defined in ANSI/ISO standards. You need to know your product; Oracle, DB2, and SQL Server are not the same in the way they handle this.

The MIN () and MAX () are called the ‘extrema’ functions. They work for all data types in SQL but with some caveats. Technically, the DISTINCT and ALL options exist, but DISTINCT is insanely redundant. If you use it, I am pretty sure any optimizer will catch it. As an aside, when the MIN () or MAX () is computed, the optimizer will take the first value as the max (or min) and compare to the following values; if it finds a greater (or lesser) value, that becomes the new max (or min). This is fast and simple scan. But you will see SQL written with ROW_NUMBER() OVER(..) followed by filters for a row number equal to one to do this same task! All that does is force a useless sort on the data,

Numeric values have a well-defined order, temporal data has a well-defined order, but strings depend on their collation to determine their order. Even the same language can have several collations; German is one example. Even worse, in SQL varying length strings have to be padded out with trailing blanks before they are compared. Thus, ‘Joe’, ‘Joe ‘ and ‘Joe ‘ are all equal. But which one represents that group?

This is why we like to design encodings that use only the basic ASCII subset that ISO requires for all Unicode languages. But that is another topic.

As usual, NULLs pose their own problem. Technically a NULL is not equal anything, not even another NULL. Following that model, each NULL in the result set should form its own group. We actually debated that in ANSI X3H2 in the early days.

The decision was to have both equality (=) and grouping. When grouping is used, all NULLs are treated as a single group. Well, it is technically called an equivalence class for anybody that had a course in set theory. This was also one reason that we dropped NULLs from the result set. This convention is actually pretty handy in practice.  

Group Characteristics

It is often useful to look at groupings that have some characteristic in common without actually looking at the particulars. This is done in the HAVING clause, which deals with the group characteristics. Doing the work in the HAVING clause can save complicated joins, CASE expressions and subqueries.

The most useful tools are usually the simplest. The SIGN (), properly called the ‘signum’, the absolute value, ABS (), and simple math are probably the most useful. Here are some examples:

We know <expression> has all distinct values, otherwise the left hand side would be smaller than the right hand side count.

There are no NULLs in the results. The cardinality on the left side would count them, but the right hand side would not.

This tell us that the group has only one value. This is actually quite handy. Variations on this idea are:

We now know the <expression> is all positive, all negative or all zero values.

Tells us we have at least one zero.

 Programming Technique For Grouping

 The basic programming technique for grouping is fairly simple and a recipe to follow. First, write CTE or select statement that gets the groups. That sounds kind of obvious, but it is a little more complicated. Most of the time, the grouping attributes are in the original set.

The common example would be geographic codes, such as state codes. These are nice discrete values that make obvious groups. They are probably well understood by the people using the data.

 Sometimes the groups may have to be computed. For example, if you have information about a birthdate, you might want to partition them into age groups (0 to 10, 11 to 14, 15 to 18, 18 to 25 and finally the horrible 65+ years of age). This still gives you nice sharp categories. However, you now have to go to a join on the table with the range of each group to get your working table

However, if we use approximate numeric values for the groups got all kinds of worries about rounding errors and things on the boundaries. This is the worst possible situation.

Once you decide what your groups are, we go to step number two. This is simply do the group by. Technically we do not actually need a GROUP BY clause. If it is not given, then the entire result set is treated as if it is a single group. Originally the group by was a simple list of columns from the query. The columns in this list did not have to appear in the SELECT statement (this was a myth from the very early days of SQL).

The result of the SELECT.. FROM.. WHERE.. clauses is partitioned into disjoint subsets, which all have the same value in each column or are all NULLs. This is what set theory calls an equivalence class and there is lots of them. The mod (n) function is another example, which gives us groups defined by 0, 1, 2, .., (n -1) values.

At one point in the history of the ANSI/ISO standards, we got interested in OLAP processing and added cubes, rollups, grouping sets and a grand total. All of these can be defined using the GROUPING SETS construct, but it actually gets pretty awkward to do it that way.

 HAVING Clause

Finally, we worry about using the having clause which actually can in some ways be the simplest of all of this. It is really just a WHERE clause that applies to the grouped table. We are back to simple search conditions again!

We do not actually need the HAVING clause; if it is not there. We simply stop the point we have done the group by. But this is where we get some real power to pick out grouped character grouping characteristics.

 Example Queries

A simple example of this technique will give you an idea how it works. Here a skeleton table for the example.

Our first problem is to find the salesman who sold all three products

This will give us ‘Mary’ and “Fred’ as the salesman who sold the whole catalog. Now, tell me if we have a salesman who has sold only one product. We do not care which product, we are looking for a specialized guy.

But why do it this way? Why not use “HAVING COUNT (DISTINCT product_name) = 1” instead?  This would look like the first query. The optimizer has to do a scan to count the product names, but the table statistics in every SQL compiler I know about keeps at least the min, max, and row count for computing execution plans. Once I see the min (or max) value for that column, I know that I have found the min (or max) value for each salesman and can ignore the rest of his values.

 Try another one: Do we have salesmen who sold over $100.00 or products? The obvious way is:

We get ‘Fred’, ‘Vlad’ and  ‘John’ as the result set. But there is an alternative.

In this case, this trick is probably not worth it. The next example applies the techniques in a more complex situation. This example was actually taken from a posting on a SQL forum. Yes, it is a skeleton, so be kind. However, it can be used to demonstrate how to get things into a form where we can use some of the tricks with the GROUP BY.

What I want is the latest creation_date for a given name where the assembler_flag is zero or if there is no zero assembler_flag for the name then the latest creation_date for assembler_flag = 1.

 The results are summarized here.

Albert

‘2015-01-02’

all assembler_flags = 1

 1

Bob

‘2015-01-01’

all assembler_flags = 0

 0

Chuck

‘2015-02-02’

mixed assembler_flags

 0

 I named the column “assembler_flag” to stress that we do not use BIT flags in SQL; that was assembler language and this is a predicate based language. But did you notice that Bob and Chuck are treated the same way? We lost information about Chuck having mixed flags. We cannot handle NULL flags in this query, either. We also spent time partitioning, sorting and materializing a row number.

 The usual trick for finding if a grouping has only one value is to use “HAVING (MIN (assembler_flag) = MAX (assembler_flg))“, but SQL Server gives an error (Msg 8117, Level 16, State 1, Line 6; Operand data type bit is invalid for min operator)!  This is yet another reason not to write 1960’s assembly flag code in SQL. The BIT data type in SQL Server was originally a true BIT (0, 1), then it became an exact numeric data type (0, 1, NULL) in later releases. But it does not do aggregate functions the way you might expect.

 We still have to handle NULLs without business rules from the poster. Let’s assume that a NULL becomes zero. This gives us the first hack at an answer, with a CTE to handle the NULL problem.

This will give you Albert and Chuck (each has only one flag value), but not Bob (mixed flags). Back to the drawing board! Well, keyboard.

The result can be summarized like this:

Albert

‘2015-01-02’

all assembler_flags = 1

Bob

‘2015-01-01’

all assembler_flags = 0

Chuck

 ‘2015-02-02’

mixed assembler_flag = 0

 The important thing to notice in this example is that the CASE expression WHEN clauses have to be at the same level of aggregation as the groups. That is why I am using SUM() and COUNT() in the when clauses. Trivially, constants like ‘mixed’ are considered to be at all levels of aggregation, as are computations done on the aggregates.