{"id":97209,"date":"2023-07-10T20:49:20","date_gmt":"2023-07-10T20:49:20","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=97209"},"modified":"2023-06-22T20:50:28","modified_gmt":"2023-06-22T20:50:28","slug":"the-group-by-clause","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/theory-and-design\/the-group-by-clause\/","title":{"rendered":"The GROUP BY Clause"},"content":{"rendered":"<p>When you\u2019re learning SQL DML, the most complicated clause is typically the <code>GROUP BY<\/code>. It\u2019s a fairly simple grouping based on values from the<code> FROM<\/code> clause in a <code>SELECT<\/code> statement. It\u2019s what a mathematician would call an <strong>equivalence<\/strong> relation. This means that each grouping, or equivalence class, has the same value for a given characteristic function and the rows are all partitioned into disjoint subsets (groups) that had the same value under that function. The results table can also have columns that give group characteristics; I will get into those in a minute.<\/p>\n<p>In the case of the usual <code>GROUP BY<\/code>, the function is equality. (More or less. I\u2019ll get to that in a bit, too, but if you know anything about relational programming you might suspect <code>NULL<\/code> to be involved). Another possible function is modulus arithmetic. Taking <code>MOD (&lt;integer expression&gt;, 2)<\/code> splits the results into odd and even groups, for example.<\/p>\n<p>Because SQL is an orthogonal language, we can actually do some fancy tricks with the <code>GROUP BY<\/code> clause. The term <strong>orthogonality<\/strong> refers to the property of a computer language which allows you to use any expression that returns a validate result anywhere in the language. Today, you take this property for granted in modern programming languages, but this was not always the case. The original <code>FORTRAN<\/code> allowed only certain expressions to be used as array indexes (it has been too many decades, but some of the ones allowed were <code>&lt;integer constant&gt;<\/code>, <code>&lt;integer variable&gt;<\/code>, and <code>&lt;integer constant&gt;*&lt;integer variable&gt;<\/code>) this was due to the operations allowed by the hardware registers of the early IBM machines upon which <code>FORTRAN<\/code> was implemented.<\/p>\n<h2>NULL Values and Grouping<\/h2>\n<p><code>NULL<\/code> values have always been problematic. One of the basic rules in SQL is that a <code>NULL<\/code> value does not equal anything including another <code>NULL<\/code>. This implies each row should either be excluded or form its own singleton group when you use equality as your characteristic function.<\/p>\n<p>We discussed this problem in the original ANSI X3H2 Database Standards Committee. There was one member\u2019s company SQL which grouped using strict equality. They ran into a problem with a customer\u2019s database involving traffic tickets. If an automobile did not have a tag, then obviously the correct data model would have been to use a <code>NULL<\/code> value. Unfortunately, in the real world, this meant every missing tag became its own group. This is not too workable in the state of California. A simple weekly report quickly became insanely long and actually hid information.<\/p>\n<p>When an automobile was missing a tag, the convention had been put in something a human being could read and they picked \u201cnone\u201d as the dummy value. Then along came somebody who got a prestige tag that read \u201cNONE\u201d to be cute. The system cheerfully dumped thousands of traffic tickets on to him as soon as his new tag got into the system. Other members had similar stories.<\/p>\n<p>This led us to the equivalence relationship which I will call grouping. It acts just like equality for non-<code>NULL<\/code> values, but it treats <code>NULL<\/code> values as if they are all equal (the <code>IS [NOT] IS DISTINCT<\/code> infixed comparison operator did not exist at the time).<\/p>\n<h2>GROUP BY Basics<\/h2>\n<p>The skeleton syntax of a simple <code>SELECT<\/code> statement with a <code>GROUP BY<\/code> clause is<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT &lt;group column expression list&gt;\r\nFROM &lt;table expression&gt;\r\n[WHERE &lt;row search condition&gt;]0\r\nGROUP BY &lt;column expression list&gt;\r\n[HAVING &lt;group search condition&gt;];<\/pre>\n<p>Here are the basic characteristics of this construct:<\/p>\n<ul>\n<li>You get one row returned for each group. Each row has only group characteristics; that means aggregate functions of some kind and columns that were used in the characteristic function.<\/li>\n<li>The <code>GROUP BY<\/code> clause can only be used with a SQL <code>SELECT<\/code> statement.<\/li>\n<li>The <code>GROUP BY<\/code> clause must be after the <code>WHERE<\/code> clause. (if the query has one. If the <code>WHERE<\/code> clause is absent, then the whole table is treated as if it\u2019s one group).<\/li>\n<li>The <code>GROUP BY<\/code> clause must be before the <code>ORDER BY<\/code> clause. (if the query has one).<\/li>\n<li>To filter the <code>GROUP BY<\/code> results, you must use the <code>HAVING<\/code> clause after the <code>GROUP BY<\/code>.<\/li>\n<li>The <code>GROUP BY<\/code> clause is often used in conjunction with aggregate functions.<\/li>\n<li>All column names listed in the <code>SELECT<\/code> clause should also appear in the <code>GROUP BY<\/code> clause, whether you have an aggregate function or not.<\/li>\n<\/ul>\n<p>Note that using a <code>GROUP BY<\/code> clause Is meaningless if there are no duplicates in the columns you are grouping by.<\/p>\n<h2>Simple Statistics<\/h2>\n<p>The original <code>GROUP BY<\/code> clause came with a set of simple descriptive statistics; the <code>COUNT<\/code>, <code>AVG<\/code>, <code>MIN<\/code>, <code>MAX<\/code>, and <code>SUM<\/code> functions. Technically, they all have the option of a <code>DISTINCT<\/code> or <code>ALL<\/code> parameter qualifier. One of the weirdness of SQL is that various constructs can have parentheses for parameters or lists, and these lists can have SQL keywords inside of the parameters. Let\u2019s look at these basic functions one at a time.<\/p>\n<p><strong><code>COUNT([ALL | DISTINCT | *] &lt;expression&gt;) <\/code><\/strong><\/p>\n<p>Returns an integer between zero whatever the count of the rows in this group is. If the expression returns <code>NULL<\/code>, it is Ignored in the count. The <code>ALL<\/code> option is redundant, and I\u2019ve never seen anybody use this in the real world. The <code>DISTINCT<\/code> option removes redundant duplicates before applying the function. The <code>*<\/code> (asterisk) as the parameter applies only to the <code>COUNT()<\/code> function for obvious reasons. It is sort of a wildcard that stands for \u201cgeneric row\u201d, without regard to the columns that make up that row.<\/p>\n<p><strong><code>SUM ([ALL | DISTINCT |  &lt;expression&gt;)  <\/code><\/strong><\/p>\n<p>This is a summation of a numeric value after the <code>NULL<\/code> values have been removed from the set. Obviously, this applies only to numeric data. If you look at SQL, you\u2019ve probably knows it has a lot of different kinds of numeric data. I strongly suggest taking a little time to see how your particular SQL product returns the result. The scale and precision may vary from product to product.<\/p>\n<p><strong><code>AVG() <\/code><\/strong><\/p>\n<p>This function is a version of the simple arithmetic mean. Obviously, this function would only apply to columns with numeric values. It sums all the non-<code>NULL<\/code> values in a set, then divides that number by the number of non-<code>NULL<\/code> values in that set to return the average value. It\u2019s also wise to be careful about <code>NULL<\/code> values; consider a situation where you have a table that models employee compensation. This compensation includes a base salary and bonuses. But only certain employees are eligible for bonuses. Unqualified employees have rows that show the <code>bonus_amt<\/code> column value as <code>NULL<\/code>. This lets us maintain the difference between an employee who is not qualified and somebody who just didn\u2019t get any bonus ($0.00) in this paycheck. The query should look like this skeleton:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT emp_id,\r\n  SUM(salary_amount + COALESCE (bonus_amount, 0.00))\r\n  AS total_compensation_amount\r\nFROM Paychecks\r\nGROUP BY emp_id;<\/pre>\n<p><strong><code>MIN() and MAX()<\/code><\/strong><\/p>\n<p>These functions are called <strong>extrema<\/strong> functions in mathematics. Since numeric, string and temporal functions have ordering to them, they can be used by these two functions. Technically, you can put in the <code>ALL<\/code> and <code>DISTINCT<\/code> in the parameter list; which is somewhat absurd.<\/p>\n<p>The reason for picking the small set of descriptive statistics was the ease of implementation. They are still quite powerful and there\u2019s a lot of cute tricks you can try with them. For example, if (<code>MAX(x) = MIN(x)<\/code>) then we know the (<code>x<\/code>) column has one and only one value in it. Likewise, (<code>COUNT(*) = COUNT(x)<\/code>) tells us that column <code>x<\/code> does not have any <code>NULL<\/code> vales. But another reason for this selection of aggregate functions is that some of these things are already being collected by cost-based optimizers. We essentially got them for free.<\/p>\n<h2>More Descriptive Stats<\/h2>\n<p>The SQL:2006 Standard added some more descriptive aggregate functions to the language. I\u2019ve actually never seen anybody use them, but they\u2019re officially there. Various SQL products have ended other functions in top of these, but let\u2019s go ahead and make a short list of some of what\u2019s officially there.<\/p>\n<p><strong><code>VAR_POP<\/code><\/strong><\/p>\n<p>This is also written as simply <code>VAR()<\/code> or <code>VARP()<\/code>. It\u2019s a statistic called the variance, which is defined as the sum of squares of the difference of <code>&lt;value expression&gt;<\/code> and the <code>&lt;independent variable expression&gt;<\/code>, divided by the number of rows.<\/p>\n<p>It is a statistical measurement of the spread between numbers in a data set. More specifically, variance measures how far each number in the set is from the mean, and thus from every other number in the set. We write it as \u00a0\u03c3<sup>2, <\/sup>, in mathematics. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean. You can see why this might be useful for an optimizer. We also added <code>VAR_SAMP()<\/code> To compute the sample variance.<\/p>\n<p><strong><code>STDEV_POP<\/code><\/strong><\/p>\n<p>This is the population standard deviation. The standard deviation (or \u03c3) is a measure of how dispersed the data is in relation to the mean. You can think of it as another way of getting the same information as the variance, but it can go negative.<\/p>\n<p>A standard deviation close to zero indicates that data points are close to the mean, whereas a high or low standard deviation indicates data points are respectively above or below the mean.<\/p>\n<p>We also have another function defined for a sample population, STDEV_SAMP. There are no surprises here so you can see the \u201c_SAMP\u201d postfix is consistent.<\/p>\n<p>There are other functions that were added standard deal with regression, and correlation. They are even more obscure and are never actually in the real world.<\/p>\n<p>Interestingly enough, standards did not build in the mode (most frequently occurring value in a set), and the median (middle value of a set). The mode has a problem. There can be several of them. If population has an equal number of several different values, you get a multi-modal distribution; SQL functions do not like to return non-scalar values. The median has a similar sort of problem. If the set of values has an odd number of elements in it, then the median is pretty well defined as the value which sits dead center when you order the values. However, if you have an even number of values, then it gets a little harder. You have to take the average of values in the middle of the ordering. Let\u2019s say you have a column with values <code>{1, 2, 2, 3, 3, 3}<\/code>. The values in the middle of this list are <code>{2, 3}<\/code> which averages out to <code>2.5<\/code>. But if I use an average computed with duplicate values, I get <code>13\/5 = 2.6<\/code> instead. This second weighted average is actually more accurate because it shows the slight skew toward 3.<\/p>\n<h2>OLAP Extensions ROLLUP, CUBE and GROUPING SETS<\/h2>\n<p>OLAP, or \u201cOnline Analytical Processing\u201d, became a fad in the mid-2010\u2019s. Several products devoted to this sort of statistical works hit the market and SQL decided to add extensions that would allow it to be used in place of a dedicated package. This added more descriptive statistics to SQL.<\/p>\n<p><em>Note: I would be very careful about using SQL as a statistical reporting language. We never intended it for this purpose, so it\u2019s hard to guarantee if the corrections for floating-point rounding errors and other things that need to go into a good statistical package are going to be found in your SQL product.<\/em><\/p>\n<p><code>GROUPING SETS<\/code><\/p>\n<p>The grouping sets construct Is the basis for <code>CUBE<\/code> and <code>ROLLUP<\/code>. You can think of it as shorthand for a series of <code>UNION<\/code> queries that are common in reports. But since a <code>SELECT<\/code> statement must return a table, we have to do some padding with generated <code>NULL<\/code> values to keep the same number of columns in each row. For example:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT dept_name, job_title, COUNT(*)\r\nFROM Personnel\r\nGROUP BY GROUPING SET (dept_name, job_title);<\/pre>\n<p>This will give us a count for departments as a whole and for the job titles within the departments. You can think of it as a shorthand for<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">BEGIN\r\n  SELECT dept_name, CAST(job_title AS NULL), COUNT(*)\r\n  FROM Personnel\r\n  GROUP BY dept_name\r\nUNION ALL\r\n  SELECT CAST(dept_name AS NULL), job_title, COUNT(*)\r\n  FROM Personnel\r\n  GROUP BY job_title;\r\nEND;<\/pre>\n<p>If you think about it for a minute, you\u2019ll see there\u2019s a little problem here. I don\u2019t know if the <code>NULL<\/code> values that I\u2019ve created with my <code>CAST ()<\/code> Function calls were in the original data or not. That\u2019s why we have a <code>GROUPING (&lt;grouping column name&gt;)<\/code> Function to test for it. It returns zero if the <code>NULL<\/code> was in the original data and one If it was generated, and therefore belongs to a sub group.<\/p>\n<p>For example,<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT\r\nGROUPING \r\n   CASE (dept_name) \r\n    WHEN 1 THEN \u2018Department Total\u2019.\r\n    ELSE dept_name END AS dept_name,\r\nGROUPING \r\n   CASE (job title) \r\n    WHEN 1 THEN \u2018Job Total\u2019\r\n   .ELSE job_title END AS job_title\r\n FROM Personnel\r\nGROUP BY GROUPING SET (dept_name, job_title);<\/pre>\n<p>I\u2019m a little ashamed of this example because it shows me using SQL display formatting on a result. This violates the principle of a tiered architecture.<\/p>\n<p><code>ROLLUP<\/code><\/p>\n<p>The <code>ROLLUP<\/code> subclause can be defined with the <code>GROUPING SET<\/code> construct, which is why I introduced <code>GROUPING SET<\/code> first. In reality, the <code>ROLLUP<\/code> is our good old hierarchical report in a new suit. We used to call these <strong>breakpoint<\/strong>, or <strong>control and break <\/strong>reports in the old days of sequential file processing. The report program set up a bunch of registers to keep running aggregates (usually totals). Every time you pass a control point in the input sequence of records, you dump the registers, reset them with their calculations and begin again. Because of sequential processing, lowest level in his hierarchy would print out first and then the next level of aggregation would appear and so on until you got the grand totals.<\/p>\n<p>Consider <code>GROUP BY ROLLUP (state_code, county_name, emp_id)<\/code> as shorthand for<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">GROUP BY GROUPING SETS\r\n (state_code, county_name, emp_id),\r\n(state_code, county_name),\r\n(state_code),\r\n()  -- Entire table<\/pre>\n<p>Please notice that the order of those columns in the <code>ROLLUP<\/code> clause is important. This will give you an aggregate for every employee within each county of each state, an aggregate for every county within each state, that same aggregate for each state and finally a grand aggregate for the entire data set (that is what the empty parentheses mean).<\/p>\n<p><code>CUBE<\/code><\/p>\n<p>The cube supergroup is another SQL 99 extension which is really an old friend with a new name. We used to call it \u201ccross tabs\u201d which is short for cross tabulation. In short, it creates unique groups for all possible combinations of the columns you specify. For example, if you use <code>GROUP BY CUBE on (column1, column2)<\/code> of your table, SQL returns groups for all unique values <code>(column1, column2), (NULL, column2), (column1, NULL) and (NULL, NULL)<\/code>.<\/p>\n<h2>Conclusion<\/h2>\n<p>This is just a skim over the options available in the <code>GROUP BY<\/code> clause. Anytime you have a query that works on the data with an equivalence relation, is a pretty good chance you will be able to do it using a <code>GROUP BY<\/code>.<\/p>\n<p>As a quick programming exercise, I recently saw a post on one of the SQL forums by a less-experienced programmer. He wanted to <code>UPDATE<\/code> a flag column (Yes, flags are a bad idea in SQL) from 0 to 1, if any row in the group had a one. The code got fairly elaborate because he had to destroy data as he overwrote existing rows.<\/p>\n<p>Can you write a simple piece of SQL that will give us this information, using a GROUP BY? It definitely is possible.<\/p>\n<p><strong>The GROUP BY Clause<\/strong><\/p>\n<p>When you\u2019re learning SQL DML, the most complicated clause is typically the <code>GROUP BY<\/code>. It\u2019s a fairly simple grouping based on values from the<code> FROM<\/code> clause in a <code>SELECT<\/code> statement. It\u2019s what a mathematician would call an <strong>equivalence<\/strong> relation. This means that each grouping, or equivalence class, has the same value for a given characteristic function and the rows are all partitioned into disjoint subsets (groups) that had the same value under that function. The results table can also have columns that give group characteristics; I will get into those in a minute.<\/p>\n<p>In the case of the usual <code>GROUP BY<\/code>, the function is equality. (More or less. I\u2019ll get to that in a bit, too). Another possible function is modulus arithmetic. Taking <code>MOD (&lt;integer expression&gt;, 2)<\/code> splits the results into odd and even groups, for example.<\/p>\n<p>Because SQL is an orthogonal language, we can actually do some fancy tricks with the <code>GROUP BY<\/code> clause. The term <strong>orthogonality<\/strong> refers to the property of a computer language which allows you to use any expression that returns a validate result anywhere in the language. Today, you take this property for granted in modern programming languages, but this was not always the case. The original <code>FORTRAN<\/code> allowed only certain expressions to be used as array indexes (it has been too many decades, but some of the ones allowed were <code>&lt;integer constant&gt;<\/code>, <code>&lt;integer variable&gt;<\/code>, and <code>&lt;integer constant&gt;*&lt;integer variable&gt;<\/code>) this was due to the operations allowed by the hardware registers of the early IBM machines upon which <code>FORTRAN<\/code> was implemented.<\/p>\n<h2>NULL Values and Grouping<\/h2>\n<p><code>NULL<\/code> values have always been problematic. One of the basic rules in SQL is that a <code>NULL<\/code> value does not equal anything including another <code>NULL<\/code>. This implies each row should either be excluded or form its own singleton group when you use equality as your characteristic function.<\/p>\n<p>We discussed this problem in the original ANSI X3H2 Database Standards Committee. There was one member\u2019s company SQL which grouped using strict equality. They ran into a problem with a customer\u2019s database involving traffic tickets. If an automobile did not have a tag, then obviously the correct data model would have been to use a <code>NULL<\/code> value. Unfortunately, in the real world, this meant every missing tag became its own group. This is not too workable in the state of California. A simple weekly report quickly became insanely long and actually hid information.<\/p>\n<p>When an automobile was missing a tag, the convention had been put in something a human being could read and they picked \u201cnone\u201d as the dummy value. Then along came somebody who got a prestige tag that read \u201cNONE\u201d to be cute. The system cheerfully dumped thousands of traffic tickets on to him as soon as his new tag got into the system. Other members had similar stories.<\/p>\n<p>This led us to the equivalence relationship which I will call grouping. It acts just like equality for non-<code>NULL<\/code> values, but it treats <code>NULL<\/code> values as if they are all equal (the <code>IS [NOT] IS DISTINCT<\/code> infixed comparison operator did not exist at the time).<\/p>\n<h2>GROUP BY Basics<\/h2>\n<p>The skeleton syntax of a simple <code>SELECT<\/code> statement with a <code>GROUP BY<\/code> clause is<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT &lt;group column expression list&gt;\r\nFROM &lt;table expression&gt;\r\n[WHERE &lt;row search condition&gt;]0\r\nGROUP BY &lt;column expression list&gt;\r\n[HAVING &lt;group search condition&gt;];<\/pre>\n<p>Here are the basic characteristics of this construct:<\/p>\n<ul>\n<li>You get one row returned for each group. Each row has only group characteristics; that means aggregate functions of some kind and columns that were used in the characteristic function.<\/li>\n<li>The <code>GROUP BY<\/code> clause can only be used with a SQL <code>SELECT<\/code> statement.<\/li>\n<li>The <code>GROUP BY<\/code> clause must be after the <code>WHERE<\/code> clause. (if the query has one. If the <code>WHERE<\/code> clause is absent, then the whole table is treated as if it\u2019s one group).<\/li>\n<li>The <code>GROUP BY<\/code> clause must be before the <code>ORDER BY<\/code> clause. (if the query has one).<\/li>\n<li>To filter the <code>GROUP BY<\/code> results, you must use the <code>HAVING<\/code> clause after the <code>GROUP BY<\/code>.<\/li>\n<li>The <code>GROUP BY<\/code> clause is often used in conjunction with aggregate functions.<\/li>\n<li>All column names listed in the <code>SELECT<\/code> clause should also appear in the <code>GROUP BY<\/code> clause, whether you have an aggregate function or not.<\/li>\n<\/ul>\n<p>Note that using a <code>GROUP BY<\/code> clause Is meaningless if there are no duplicates in the columns you are grouping by.<\/p>\n<h2>Simple Statistics<\/h2>\n<p>The original <code>GROUP BY<\/code> clause came with a set of simple descriptive statistics; the <code>COUNT<\/code>, <code>AVG<\/code>, <code>MIN<\/code>, <code>MAX<\/code>, and <code>SUM<\/code> functions. Technically, they all have the option of a <code>DISTINCT<\/code> or <code>ALL<\/code> parameter qualifier. One of the weirdness of SQL is that various constructs can have parentheses for parameters or lists, and these lists can have SQL keywords inside of the parameters. Let\u2019s look at these basic functions one at a time.<\/p>\n<p><code>COUNT([ALL | DISTINCT | *] &lt;expression&gt;) <\/code><\/p>\n<p>Returns an integer between zero whatever the count of the rows in this group is. If the expression returns <code>NULL<\/code>, it is Ignored in the count. The <code>ALL<\/code> option is redundant, and I\u2019ve never seen anybody use this in the real world. The <code>DISTINCT<\/code> option removes redundant duplicates before applying the function. The <code>*<\/code> (asterisk) as the parameter applies only to the <code>COUNT()<\/code> function for obvious reasons. It is sort of a wildcard that stands for \u201cgeneric row\u201d, without regard to the columns that make up that row.<\/p>\n<p><code>SUM ([ALL | DISTINCT |  &lt;expression&gt;)  <\/code><\/p>\n<p>This is a summation of a numeric value after the <code>NULL<\/code> values have been removed from the set. Obviously, this applies only to numeric data. If you look at SQL, you\u2019ve probably knows it has a lot of different kinds of numeric data. I strongly suggest taking a little time to see how your particular SQL product returns the result. The scale and precision may vary from product to product.<\/p>\n<p><code>AVG() <\/code><\/p>\n<p>This function is a version of the simple arithmetic mean. Obviously, this function would only apply to columns with numeric values. It sums all the non-<code>NULL<\/code> values in a set, then divides that number by the number of non-<code>NULL<\/code> values in that set to return the average value. It\u2019s also wise to be careful about <code>NULL<\/code> values; consider a situation where you have a table that models employee compensation. This compensation includes a base salary and bonuses. But only certain employees are eligible for bonuses. Unqualified employees have rows that show the <code>bonus_amt<\/code> column value as <code>NULL<\/code>. This lets us maintain the difference between an employee who is not qualified and somebody who just didn\u2019t get any bonus ($0.00) in this paycheck. The query should look like this skeleton:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT emp_id,\r\n  SUM(salary_amount + COALESCE (bonus_amount, 0.00))\r\n  AS total_compensation_amount\r\nFROM Paychecks\r\nGROUP BY emp_id;<\/pre>\n<p><code>MIN() and MAX()<\/code><\/p>\n<p>These functions are called <strong>extrema<\/strong> functions in mathematics. Since numeric, string and temporal functions have ordering to them, they can be used by these two functions. Technically, you can put in the <code>ALL<\/code> and <code>DISTINCT<\/code> in the parameter list; which is somewhat absurd.<\/p>\n<p>The reason for picking the small set of descriptive statistics was the ease of implementation. They are still quite powerful and there\u2019s a lot of cute tricks you can try with them. For example, if (<code>MAX(x) = MIN(x)<\/code>) then we know the (<code>x<\/code>) column has one and only one value in it. Likewise, (<code>COUNT(*) = COUNT(x)<\/code>) tells us that column <code>x<\/code> does not have any <code>NULL<\/code> vales. But another reason for this selection of aggregate functions is that some of these things are already being collected by cost-based optimizers. We essentially got them for free.<\/p>\n<h2>More Descriptive Stats<\/h2>\n<p>The SQL:2006 Standard added some more descriptive aggregate functions to the language. I\u2019ve actually never seen anybody use them, but they\u2019re officially there. Various SQL products have ended other functions in top of these, but let\u2019s go ahead and make a short list of some of what\u2019s officially there.<\/p>\n<p><code>VAR_POP<\/code><\/p>\n<p>This is also written as simply <code>VAR()<\/code> or <code>VARP()<\/code>. It\u2019s a statistic called the variance, which is defined as the sum of squares of the difference of <code>&lt;value expression&gt;<\/code> and the <code>&lt;independent variable expression&gt;<\/code>, divided by the number of rows.<\/p>\n<p>It is a statistical measurement of the spread between numbers in a data set. More specifically, variance measures how far each number in the set is from the mean, and thus from every other number in the set. We write it as <img loading=\"lazy\" decoding=\"async\" width=\"68\" height=\"91\" class=\"wp-image-97212\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/06\/a-picture-containing-font-design-description-aut-1.png\" alt=\"A picture containing font, design\n\nDescription automatically generated\" \/> \u03c3<sup>2, <\/sup>, in mathematics. Variance tells you the degree of spread in your data set. The more spread the data, the larger the variance is in relation to the mean. You can see why this might be useful for an optimizer. We also added <code>VAR_SAMP()<\/code> To compute the sample variance.<\/p>\n<p><code>STDEV_POP<\/code><\/p>\n<p>This is the population standard deviation. The standard deviation (or <img loading=\"lazy\" decoding=\"async\" width=\"43\" height=\"62\" class=\"wp-image-97213\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2023\/06\/word-image-97209-2-1.png\" \/> \u03c3) is a measure of how dispersed the data is in relation to the mean. You can think of it as another way of getting the same information as the variance, but it can go negative.<\/p>\n<p>A standard deviation close to zero indicates that data points are close to the mean, whereas a high or low standard deviation indicates data points are respectively above or below the mean.<\/p>\n<p>We also have another function defined for a sample population, STDEV_SAMP. There are no surprises here so you can see the \u201c_SAMP\u201d postfix is consistent.<\/p>\n<p>There are other functions that were added standard deal with regression, and correlation. They are even more obscure and are never actually in the real world.<\/p>\n<p>Interestingly enough, standards did not build in the mode (most frequently occurring value in a set), and the median (middle value of a set). The mode has a problem. There can be several of them. If population has an equal number of several different values, you get a multi-modal distribution; SQL functions do not like to return non-scalar values. The median has a similar sort of problem. If the set of values has an odd number of elements in it, then the median is pretty well defined as the value which sits dead center when you order the values. However, if you have an even number of values, then it gets a little harder. You have to take the average of values in the middle of the ordering. Let\u2019s say you have a column with values <code>{1, 2, 2, 3, 3, 3}<\/code>. The values in the middle of this list are <code>{2, 3}<\/code> which averages out to <code>2.5<\/code>. But if I use an average computed with duplicate values, I get <code>13\/5 = 2.6<\/code> instead. This second weighted average is actually more accurate because it shows the slight skew toward 3.<\/p>\n<h2>OLAP Extensions ROLLUP, CUBE and GROUPING SETS<\/h2>\n<p>OLAP, or \u201cOnline Analytical Processing\u201d, became a fad in the mid-2010\u2019s. Several products devoted to this sort of statistical works hit the market and SQL decided to add extensions that would allow it to be used in place of a dedicated package. This added more descriptive statistics to SQL.<\/p>\n<p><em>Note: I would be very careful about using SQL as a statistical reporting language. We never intended it for this purpose, so it\u2019s hard to guarantee if the corrections for floating-point rounding errors and other things that need to go into a good statistical package are going to be found in your SQL product.<\/em><\/p>\n<p><code>GROUPING SETS<\/code><\/p>\n<p>The grouping sets construct Is the basis for <code>CUBE<\/code> and <code>ROLLUP<\/code>. You can think of it as shorthand for a series of <code>UNION<\/code> queries that are common in reports. But since a <code>SELECT<\/code> statement must return a table, we have to do some padding with generated <code>NULL<\/code> values to keep the same number of columns in each row. For example:<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT dept_name, job_title, COUNT(*)\r\nFROM Personnel\r\nGROUP BY GROUPING SET (dept_name, job_title);<\/pre>\n<p>This will give us a count for departments as a whole and for the job titles within the departments. You can think of it as a shorthand for<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">BEGIN\r\n  SELECT dept_name, CAST(job_title AS NULL), COUNT(*)\r\n  FROM Personnel\r\n  GROUP BY dept_name\r\nUNION ALL\r\n  SELECT CAST(dept_name AS NULL), job_title, COUNT(*)\r\n  FROM Personnel\r\n  GROUP BY job_title;\r\nEND;<\/pre>\n<p>If you think about it for a minute, you\u2019ll see there\u2019s a little problem here. I don\u2019t know if the <code>NULL<\/code> values that I\u2019ve created with my <code>CAST ()<\/code> Function calls were in the original data or not. That\u2019s why we have a <code>GROUPING (&lt;grouping column name&gt;)<\/code> Function to test for it. It returns zero if the <code>NULL<\/code> was in the original data and one If it was generated, and therefore belongs to a sub group.<\/p>\n<p>For example,<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">SELECT\r\nGROUPING \r\n   CASE (dept_name) \r\n    WHEN 1 THEN \u2018Department Total\u2019.\r\n    ELSE dept_name END AS dept_name,\r\nGROUPING \r\n   CASE (job title) \r\n    WHEN 1 THEN \u2018Job Total\u2019\r\n   .ELSE job_title END AS job_title\r\n FROM Personnel\r\nGROUP BY GROUPING SET (dept_name, job_title);<\/pre>\n<p>I\u2019m a little ashamed of this example because it shows me using SQL display formatting on a result. This violates the principle of a tiered architecture.<\/p>\n<p><code>ROLLUP<\/code><\/p>\n<p>The <code>ROLLUP<\/code> subclause can be defined with the <code>GROUPING SET<\/code> construct, which is why I introduced <code>GROUPING SET<\/code> first. In reality, the <code>ROLLUP<\/code> is our good old hierarchical report in a new suit. We used to call these <strong>breakpoint<\/strong>, or <strong>control and break <\/strong>reports in the old days of sequential file processing. The report program set up a bunch of registers to keep running aggregates (usually totals). Every time you pass a control point in the input sequence of records, you dump the registers, reset them with their calculations and begin again. Because of sequential processing, lowest level in his hierarchy would print out first and then the next level of aggregation would appear and so on until you got the grand totals.<\/p>\n<p>Consider <code>GROUP BY ROLLUP (state_code, county_name, emp_id)<\/code> as shorthand for<\/p>\n<pre class=\"lang:tsql theme:ssms2012-simple-talk\">GROUP BY GROUPING SETS\r\n (state_code, county_name, emp_id),\r\n(state_code, county_name),\r\n(state_code),\r\n()  -- Entire table<\/pre>\n<p>Please notice that the order of those columns in the <code>ROLLUP<\/code> clause is important. This will give you an aggregate for every employee within each county of each state, an aggregate for every county within each state, that same aggregate for each state and finally a grand aggregate for the entire data set (that is what the empty parentheses mean).<\/p>\n<p><code>CUBE<\/code><\/p>\n<p>The cube supergroup is another SQL 99 extension which is really an old friend with a new name. We used to call it \u201ccross tabs\u201d which is short for cross tabulation. In short, it creates unique groups for all possible combinations of the columns you specify. For example, if you use <code>GROUP BY CUBE on (column1, column2)<\/code> of your table, SQL returns groups for all unique values <code>(column1, column2), (NULL, column2), (column1, NULL) and (NULL, NULL)<\/code>.<\/p>\n<h2>Conclusion<\/h2>\n<p>This is just a skim over the options available in the <code>GROUP BY<\/code> clause. Anytime you have a query that works on the data with an equivalence relation, is a pretty good chance you will be able to do it using a <code>GROUP BY<\/code>.<\/p>\n<p>As a quick programming exercise, I recently saw a post on one of the SQL forums by a less-experienced programmer. He wanted to <code>UPDATE<\/code> a flag column (Yes, flags are a bad idea in SQL) from 0 to 1, if any row in the group had a one. The code got fairly elaborate because he had to destroy data as he overwrote existing rows.<\/p>\n<p>Can you write a simple piece of SQL that will give us this information, using a GROUP BY? It definitely is possible.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When you\u2019re learning SQL DML, the most complicated clause is typically the GROUP BY. It\u2019s a fairly simple grouping based on values from the FROM clause in a SELECT statement. It\u2019s what a mathematician would call an equivalence relation. This means that each grouping, or equivalence class, has the same value for a given characteristic&#8230;&hellip;<\/p>\n","protected":false},"author":96214,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[53,143539],"tags":[],"coauthors":[6781],"class_list":["post-97209","post","type-post","status-publish","format-standard","hentry","category-featured","category-theory-and-design"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97209","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/96214"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=97209"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97209\/revisions"}],"predecessor-version":[{"id":97217,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/97209\/revisions\/97217"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=97209"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=97209"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=97209"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=97209"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}