The GROUP BY
and HAVING
clauses in SQL goes all the way back to the start of the language, in its simplest form. The original idea was that you would build a working table using the SELECT ... FROM ... WHERE
clauses, and then partition that result set into groups. A group is defined as a set of rows with the same values for the grouping columns. The non-grouping columns had to be group properties (usually aggregate functions), formulas or computations done on these or constants (which is technically a group property).
Each group is then reduced down to a single row. The important concept here is the original table no longer exists. However, if you are old enough to have to work with the very early versions of SQL Server, the Sybase syntax did not get rid of the original table. This led to some problems. It exposed the original row data with the aggregated data.
The original HAVING
clause was then applied to each of the rows in the new grouped table. This meant that it was done after the WHERE
clause that built the original table.
Later we added Windows clauses, extensions to the GROUP BY
in the form of ROLLUP
, CUBE
and GROUPING SETS
. But there is still quite a bit you can do with the original simple GROUP BY
– HAVING
constructs. They have the advantage of completing their work in a single table scan, while the newer constructs may need temporary storage or multiple passes through the data. The originals are portable and predictable. They are now well-optimized after all these decades.
Basic Aggregate Functions
In the original SQL, we had a limited set of simple aggregate functions. Their general format was ‘<function name> ([DISTINCT| ALL] <expression>)
“. The first thing these functions do is to remove all of the NULL
s from the data. The function could be applied to an empty set, if only NULL
s were returned. What is the sum or average of an empty set? Following the usual conventions of SQL, empty sets return NULL
, not zero. Zero is an actual value; that was a very big debate in the Middle Ages, when Europeans started seeing zero showing up in those newfangled Hindu Arabic numerals
SQL has a convention that we can put keywords inside the parameters in some of the intrinsic system functions. The aggregate functions have the option of DISTINCT
or ALL
; if nothing is given, then ALL
is assumed.
SUM ([DISTINCT| ALL] <numeric expression>)
= arithmetic total of a numeric expression.
AVG ([DISTINCT| ALL] <numeric expression>)
= arithmetic average or mean of a numeric expression.
MIN ([DISTINCT| ALL] <expression>)
= depending on the expression, the smallest numeric value, the least recent temporal value, or the lowest collation value in a string.
MAX ([DISTINCT| ALL] <expression>)
= depending on the expression, the greatest numeric value, the temporal value for the wrist in the future, or the highest collation value in a string.
COUNT ([DISTINCT| ALL] <expression>)
= the cardinality of the set of expression values, regardless of type.
COUNT ([DISTINCT| ALL] *)
= the cardinality of the table, taken as a whole set.
The DISTINCT
keyword says remove all redundant duplicates (i.e. keep one copy) from the set of expressions being aggregated. The ALL
keyword says to retain all the rows (this is like UNION
versus UNION ALL
and so forth in other parts of the language). In practice, nobody actually uses the ALL
option; in fact, a lot of the newer SQL programmers do not even know it exists.
It may come as a surprise that <expression> as a parameter to an aggregate function does not have to be just a simple column name. However, it does have some limitations, based on scoping. This is why you cannot nest simple aggregate functions inside each other. Think about SUM (AVG (x))
and what it means. First, we would compute the average. It would be a single value (or a NULL
). The sum of a single value is just that single value (or a NULL
).
Also, the expression has to be of an appropriate data type. The SUM ()
and AVG ()
are clearly numeric. It would not make much sense to take an average of a result set of dates or strings, would it? However, the scale and precision of the results are implementation defined in ANSI/ISO standards. You need to know your product; Oracle, DB2, and SQL Server are not the same in the way they handle this.
The MIN ()
and MAX ()
are called the ‘extrema’ functions. They work for all data types in SQL but with some caveats. Technically, the DISTINCT
and ALL
options exist, but DISTINCT
is insanely redundant. If you use it, I am pretty sure any optimizer will catch it. As an aside, when the MIN ()
or MAX ()
is computed, the optimizer will take the first value as the max (or min) and compare to the following values; if it finds a greater (or lesser) value, that becomes the new max (or min). This is fast and simple scan. But you will see SQL written with ROW_NUMBER()
OVER(..)
followed by filters for a row number equal to one to do this same task! All that does is force a useless sort on the data,
Numeric values have a well-defined order, temporal data has a well-defined order, but strings depend on their collation to determine their order. Even the same language can have several collations; German is one example. Even worse, in SQL varying length strings have to be padded out with trailing blanks before they are compared. Thus, ‘Joe’, ‘Joe ‘ and ‘Joe ‘ are all equal. But which one represents that group?
This is why we like to design encodings that use only the basic ASCII subset that ISO requires for all Unicode languages. But that is another topic.
As usual, NULL
s pose their own problem. Technically a NULL
is not equal anything, not even another NULL
. Following that model, each NULL
in the result set should form its own group. We actually debated that in ANSI X3H2 in the early days.
The decision was to have both equality (=) and grouping. When grouping is used, all NULL
s are treated as a single group. Well, it is technically called an equivalence class for anybody that had a course in set theory. This was also one reason that we dropped NULL
s from the result set. This convention is actually pretty handy in practice.
Group Characteristics
It is often useful to look at groupings that have some characteristic in common without actually looking at the particulars. This is done in the HAVING
clause, which deals with the group characteristics. Doing the work in the HAVING
clause can save complicated joins, CASE expressions and subqueries.
The most useful tools are usually the simplest. The SIGN (), properly called the ‘signum’, the absolute value, ABS ()
, and simple math are probably the most useful. Here are some examples:
1 |
HAVING COUNT (DISTINCT <expression>) = COUNT (<expression>) |
We know <expression> has all distinct values, otherwise the left hand side would be smaller than the right hand side count.
1 |
HAVING COUNT (*) = COUNT (<expression>); |
There are no NULL
s in the results. The cardinality on the left side would count them, but the right hand side would not.
1 |
HAVING MIN (<expression>) = MAX (<expression>) |
This tell us that the group has only one value. This is actually quite handy. Variations on this idea are:
1 |
HAVING MIN (SIGN (<expression>)) = MAX (SIGN (<expression>)) |
We now know the <expression> is all positive, all negative or all zero values.
1 |
HAVING MIN (ABS (<expression>)) = 0; |
Tells us we have at least one zero.
Programming Technique For Grouping
The basic programming technique for grouping is fairly simple and a recipe to follow. First, write CTE or select statement that gets the groups. That sounds kind of obvious, but it is a little more complicated. Most of the time, the grouping attributes are in the original set.
The common example would be geographic codes, such as state codes. These are nice discrete values that make obvious groups. They are probably well understood by the people using the data.
Sometimes the groups may have to be computed. For example, if you have information about a birthdate, you might want to partition them into age groups (0 to 10, 11 to 14, 15 to 18, 18 to 25 and finally the horrible 65+ years of age). This still gives you nice sharp categories. However, you now have to go to a join on the table with the range of each group to get your working table
However, if we use approximate numeric values for the groups got all kinds of worries about rounding errors and things on the boundaries. This is the worst possible situation.
Once you decide what your groups are, we go to step number two. This is simply do the group by. Technically we do not actually need a GROUP BY
clause. If it is not given, then the entire result set is treated as if it is a single group. Originally the group by was a simple list of columns from the query. The columns in this list did not have to appear in the SELECT
statement (this was a myth from the very early days of SQL).
The result of the SELECT.. FROM.. WHERE.
. clauses is partitioned into disjoint subsets, which all have the same value in each column or are all NULL
s. This is what set theory calls an equivalence class and there is lots of them. The mod (n)
function is another example, which gives us groups defined by 0, 1, 2, .., (n -1) values.
At one point in the history of the ANSI/ISO standards, we got interested in OLAP processing and added cubes, rollups, grouping sets and a grand total. All of these can be defined using the GROUPING SETS
construct, but it actually gets pretty awkward to do it that way.
HAVING Clause
Finally, we worry about using the having clause which actually can in some ways be the simplest of all of this. It is really just a WHERE
clause that applies to the grouped table. We are back to simple search conditions again!
We do not actually need the HAVING
clause; if it is not there. We simply stop the point we have done the group by. But this is where we get some real power to pick out grouped character grouping characteristics.
Example Queries
A simple example of this technique will give you an idea how it works. Here a skeleton table for the example.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 |
CREATE TABLE Sales (salesman_name VARCHAR (10) NOT NULL, product_name VARCHAR (10) NOT NULL CHECK (product_name IN ('scissors', 'paper', 'stone')), sale_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL, sale_amt DECIMAL(6,2) NOT NULL CHECK (sale_amt >= 0.00) PRIMARY KEY (salesman_name, product_name, sale_date)); INSERT INTO Sales VALUES ('Fred', 'scissors', '2016-01-01', 100.00), ('Fred', 'stone', '2016-01-01', 75.98), ('Fred', 'paper', '2016-01-01', 40.00), ('John', 'paper', '2016-01-02', 35.85), ('John', 'paper', '2016-01-03', 0.00), ('John', 'stone', '2016-01-01', 100.00), ('John', 'stone', '2016-01-02', 5.00), ('John', 'stone', '2016-01-04', 4.98), ('Mary', 'paper', '2016-01-11', 45.95), ('Mary', 'scissors', '2016-01-10', 0.00), ('Mary', 'stone', '2016-01-12',20.25), ('Vlad', 'stone', '2016-01-11', 23.95), ('Vlad', 'stone', '2016-01-12', 100.00), ('Vlad', 'stone', '2016-01-13', 75.98), ('Vlad', 'stone', '2016-01-14', 16.85); |
Our first problem is to find the salesman who sold all three products
1 2 3 4 |
SELECT salesman_name FROM Sales GROUP BY salesman_name HAVING COUNT (DISTINCT product_name) = 3; |
This will give us ‘Mary’ and “Fred’ as the salesman who sold the whole catalog. Now, tell me if we have a salesman who has sold only one product. We do not care which product, we are looking for a specialized guy.
1 2 3 4 |
SELECT salesman_name, MAX(product_name) AS single_product_name FROM Sales GROUP BY salesman_name HAVING MIN(product_name) = MAX(product_name); |
But why do it this way? Why not use “HAVING COUNT (DISTINCT product_name) = 1
” instead? This would look like the first query. The optimizer has to do a scan to count the product names, but the table statistics in every SQL compiler I know about keeps at least the min, max, and row count for computing execution plans. Once I see the min (or max) value for that column, I know that I have found the min (or max) value for each salesman and can ignore the rest of his values.
Try another one: Do we have salesmen who sold over $100.00 or products? The obvious way is:
1 2 3 4 |
SELECT salesman_name FROM Sales GROUP BY salesman_name HAVING SUM(sale_amt) >= 100.00; |
We get ‘Fred’, ‘Vlad’ and ‘John’ as the result set. But there is an alternative.
1 2 3 4 5 |
SELECT salesman_name FROM Sales GROUP BY salesman_name HAVING SIGN(SUM(sale_amt) - 100.00) >= 0; |
In this case, this trick is probably not worth it. The next example applies the techniques in a more complex situation. This example was actually taken from a posting on a SQL forum. Yes, it is a skeleton, so be kind. However, it can be used to demonstrate how to get things into a form where we can use some of the tricks with the GROUP BY
.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 |
CREATE TABLE Foo (generic_id CHAR (2) NOT NULL PRIMARY KEY, vague_name VARCHAR (20) NOT NULL); INSERT INTO Foo VALUES ('10', 'Albert'), ('11', 'Bob'), ('12', 'Chuck'); CREATE TABLE Bar (generic_id CHAR (2) NOT NULL REFERENCES Foo (generic_id), creation_date DATE DEFAULT CURRENT_TIMESTAMP NOT NULL, PRIMARY KEY (generic_id, creation_date), assembler_flag BIT); INSERT INTO Bar VALUES ('10', '2015-01-01', 1), ('10', '2015-01-02', 1), ('11', '2015-01-01', 0), ('11', '2015-01-12', 1), ('12', '2015-01-01', 0), ('12', '2015-02-02', 0); |
What I want is the latest creation_date for a given name where the assembler_flag is zero or if there is no zero assembler_flag for the name then the latest creation_date for assembler_flag = 1.
1 2 3 4 5 6 7 8 9 |
SELECT T.vague_name, T.creation_date, T.assembler_flag FROM (SELECT Foo.vague_name, Bar.creation_date, Bar.assembler_flag, ROW_NUMBER() OVER(PARTITION BY Foo.generic_id ORDER BY Bar.assembler_flag, Bar.creation_date DESC) AS rn FROM Bar, Foo WHERE Foo.generic_id = Bar.generic_id) AS T WHERE T.rn = 1; |
The results are summarized here.
Albert |
‘2015-01-02’ |
all assembler_flags = 1 |
1 |
Bob |
‘2015-01-01’ |
all assembler_flags = 0 |
0 |
Chuck |
‘2015-02-02’ |
mixed assembler_flags |
0 |
I named the column “assembler_flag” to stress that we do not use BIT flags in SQL; that was assembler language and this is a predicate based language. But did you notice that Bob and Chuck are treated the same way? We lost information about Chuck having mixed flags. We cannot handle NULL
flags in this query, either. We also spent time partitioning, sorting and materializing a row number.
The usual trick for finding if a grouping has only one value is to use “HAVING (MIN (assembler_flag) = MAX (assembler_flg))
“, but SQL Server gives an error (
Msg 8117, Level 16, State 1, Line 6; Operand data type bit is invalid for min operator
)! This is yet another reason not to write 1960’s assembly flag code in SQL. The BIT
data type in SQL Server was originally a true BIT (0, 1), then it became an exact numeric data type (0, 1, NULL
) in later releases. But it does not do aggregate functions the way you might expect.
We still have to handle NULL
s without business rules from the poster. Let’s assume that a NULL
becomes zero. This gives us the first hack at an answer, with a CTE to handle the NULL
problem.
1 2 3 4 5 6 7 8 9 10 11 |
WITH Build_Groups AS (SELECT Foo.vague_name, Bar.creation_date, CAST (COALESCE(Bar.assembler_flag, 0) AS INTEGER) AS flg FROM Foo, Bar WHERE Foo.generic_id = Bar.generic_id) SELECT vague_name, MAX(creation_date) AS recent_creation_date FROM Build_Groups GROUP BY vague_name HAVING MIN(flg) = MAX(flg); |
This will give you Albert and Chuck (each has only one flag value), but not Bob (mixed flags). Back to the drawing board! Well, keyboard.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
WITH Build_Groups AS (SELECT Foo.vague_name, Bar.creation_date, CAST (COALESCE (Bar.assembler_flag, 0) AS INTEGER) AS flg FROM Foo, Bar WHERE Foo.generic_id = Bar.generic_id) SELECT vague_name, CASE WHEN SUM (flg) = COUNT (flg) THEN 'ones' WHEN SUM (flg) = 0 THEN 'zeros' ELSE 'mixed' END AS flag_class FROM Build_Groupso90io90io90io90ib n GROUP BY vague_name; |
The result can be summarized like this:
Albert |
‘2015-01-02’ |
all assembler_flags = 1 |
Bob |
‘2015-01-01’ |
all assembler_flags = 0 |
Chuck |
‘2015-02-02’ |
mixed assembler_flag = 0 |
The important thing to notice in this example is that the CASE
expression WHEN
clauses have to be at the same level of aggregation as the groups. That is why I am using SUM()
and COUNT()
in the when clauses. Trivially, constants like ‘mixed’ are considered to be at all levels of aggregation, as are computations done on the aggregates.
Load comments