{"id":71677,"date":"2017-07-17T11:08:43","date_gmt":"2017-07-17T11:08:43","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=71677"},"modified":"2021-09-29T16:21:11","modified_gmt":"2021-09-29T16:21:11","slug":"sql-data-aggregation-aggravation","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/t-sql-programming-sql-server\/sql-data-aggregation-aggravation\/","title":{"rendered":"SQL Data Aggregation Aggravation"},"content":{"rendered":"<p>Raw data is usually bulky, so we like to aggregate it into smaller chunks to make it easier to digest, but there are a lot of subtle problems with aggregating data. We like to aggregate data in such a way that when we put the small chunks back together, we get the original set. In mathematics, this is called a partitioning of a set. When you can do it; it\u2019s very, very, nice. But it\u2019s not always possible.<\/p>\n<h2>Defining Subgroups<\/h2>\n<p>Your first thought is that you\u2019re going to partition it by using some kind of <code>GROUP BY<\/code> clause. But back up a step. Many years ago there was a popular puzzle that originally appeared in a Martin Gardner Scientific American column called The Vanishing Leprechaun (See the <a href=\"https:\/\/www.youtube.com\/watch?v=EazHcAnYj7A\">Youtube demonstration here<\/a>). The puzzle consists of three parts, like a jigsaw puzzle, which can be rearranged one of two ways. Depending how the pieces are arranged, one of the leprechauns seems to disappear.<\/p>\n<p>The puzzle asks the false question; \u201cwhere did the leprechaun go?\u201d The problem was that the \u201cleprechaun\u201d was not well-defined. If we were to talk about using the same flour, eggs, butter and sugar to bake a dozen cookies or to make a small cake, you would not look at the cake and ask where the individual cookies are inside the cake, nor where the cake is inside the individual cookies. You would realize the two options with the original materials were aggregated differently. The cake is simply different from the cookies, and vice versa.<\/p>\n<p>As an example of a well-defined aggregation, consider the children\u2019s game Cootie (for an explanation<a href=\"https:\/\/en.wikipedia.org\/wiki\/The_Game_of_Cootie\">, see the Wikipedia page<\/a>). The object of the game is to be the first player to build a &#8220;cootie&#8221; piece by piece from various <a href=\"https:\/\/en.wikipedia.org\/wiki\/Plastic\">plastic<\/a> body parts that include a beehive-like body, a head, antennae, eyes, a coiled proboscis, and six legs. Body parts are acquired by rolling a die. The winner is the first player to completely assemble a cootie. The number of each body part required to make a valid cootie is very well-defined; if he\u2019s missing a part, he\u2019s not a cootie.<\/p>\n<p>The moral to the story is that the set of aggregations is not always the same as the original set.<\/p>\n<h2>Equivalence Classes and Relations<\/h2>\n<p>In set theory, we have equivalence classes and equivalence relations. They partition the set based on some relationship that is well-defined. The two that come to mind in SQL are equality and grouping. The difference is that simple equality (=) doesn\u2019t work with NULLs, but only known values. We also have the <code>mod (n)<\/code> function, which creates<code> (n)<\/code> equivalence classes; the simplest example is that of breaking integers into odd and even classes.<\/p>\n<p>Grouping (as in the<code> GROUP BY<\/code> clause, <code>PARTITION<\/code> and some other places in SQL) treats <code>NULL<\/code>s as falling in the same equivalence class.<\/p>\n<p>By using the <code>AVG()<\/code> and<code> SUM()<\/code> math functions with <code>GROUP BY<\/code>, you reduce a group attribute to a value. These are the ones we picked in the early days of SQL, because they were very easy and very well understood. The problem is that they depend on the numeric values that we can compute.<\/p>\n<p>We could depict the median, or a whole bunch of other measures of central tendency whose names involve a famous statistician, or Greek letter. They have a common characteristic, however; they depend on numeric values upon which we can do computations. In short, they\u2019re not very good for discrete or descriptive values. In fact, they don\u2019t make sense for such things. What is the average of the colors of shoes when you group by a city? If both yellow and red are equally represented, is the average color orange? The question is absurd. This is a classic issue of continuous versus discrete variables when you are doing statistics. Did you notice that the value returned does not have to be a member of the set from which it was created? That\u2019s important. It means that it\u2019s perfectly logical for the families in the population to have an average of 2.3 children. But in practice, it\u2019s very embarrassing if the police find you with a 0.3 child in your house.<\/p>\n<p>SQL did give you some aggregate functions that apply to discrete variables. The extrema functions (<code>MIN ()<\/code> and <code>MAX ()<\/code>) extract a single value from the class. These values have be attributes of at least one element in that set. The obvious discrete statistical function that is missing is the <code>MODE(),<\/code> or most frequently occurring value in the set. It\u2019s a terrible measure of central tendency in the set and there can be several of them in the case of ties, but it has the advantage of representing majority rule voting when it is unique.<\/p>\n<p>A major problem with the mode, and other discrete statistical functions, is that they behave differently for the whole population than in subsets of the population. The best-known example of this is Simpson\u2019s paradox (see my article: <a href=\"https:\/\/www.simple-talk.com\/opinion\/opinion-pieces\/data-is-crazier-than-you-think\/\">Data is crazier than you think<\/a>).<\/p>\n<p>Simpson\u2019s paradox has nothing to do with Homer Simpson or O. J. Simpson; it is named after the British statistician Edward Hugh Simpson who wrote about it in the 1950s. Simpson\u2019s paradox happens when a trend that appears in separate groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data. It tends to happen when you partition your data into subsets where each of the subsets is a little skewed and of odd sizes, and illustrates the dangers of errors in population sampling.<\/p>\n<h2>Gerrymandering<\/h2>\n<p>Instead of trying to break up our population into subsets that avoid Simpson\u2019s paradox, we can actually try to increase it! This is called gerrymandering. For a really nice animation on the basic principles go to <a href=\"https:\/\/www.washingtonpost.com\/video\/business\/gerrymandering-explained\/2016\/04\/21\/e447f5c2-07fe-11e6-bfed-ef65dff5970d_video.html\">Gerrymandering Explained<\/a>.<\/p>\n<p>The example is a population of 50 voters who belong to either Red or Blue party. We have voters who are 60% Blue and 40% Red. In what is called a perfect representation, each of five districts has two Reds and three Blues in it. This gives Blues absolute domination over the five districts of 10 voters each. And we have the tyranny of the majority.<\/p>\n<p>However, I can group the voters in such a way that I have six Blue districts and four Red districts. Even better, create two all Blue districts and three (6 Reds + 4 Blues) districts. Suddenly the Reds are in control!<\/p>\n<p>In the case of gerrymandering, this is done deliberately, but in the case of sampling, it might be accidental, or at least unintentional. For my generation, we remember the draft lottery in 1969 for the Vietnam war. A piece of paper with each calendar day was put it in a capsule, and the capsule was drawn from a bingo game tumbling cage. The capsules were drawn by hand and that date was assigned draft number #1, the next date was draft #2, until the last one was draft #366 (they included leap days).<\/p>\n<h2>Bad Sampling<\/h2>\n<p>These numbers determined the order in which people would be drafted, and it was projected that about 10% of the males would be sent to Vietnam immediately after bootcamp and that another 10% would go to Vietnam within their period of enlistment.<\/p>\n<p>The correlation between day of birth (1-366) and draft number was -.28. A random process would have produced a correlation around zero. Because this correlation was statistically significant, the lottery was challenged in court. The reason the drawing didn\u2019t work properly was the way the capsules were put in the tumbler and drawn from it. This was fixed later in the following draft lottery, which was based on initials.<\/p>\n<h2>Ntile Partitions<\/h2>\n<p>One obvious way to split up your data with the <code>RANK<\/code> and <code>DENSE_RANK<\/code> functions. The problem with this is that the groups they form have a uniform value for whatever you grouped on. We are really trying to find subsets of things. Yes, they should have a commonality, but not be identical. Think back to our voting districts; they really should have some variety of voters.<\/p>\n<p>The <code>NTILE()<\/code> function varies a bit from one SQL or statistical product to another, but they are all basically the same in intent. R and statistical packages tend to have more options than just SQL. The Microsoft syntax is a windowed function call:<\/p>\n<p><code>NTILE &lt;positive integer constant&gt;) OVER ([&lt;partition by clause&gt;] &lt;order by clause&gt;)<\/code><\/p>\n<p>The &lt;positive integer constant&gt; specifies the number of groups into which each partition must be divided. Without the <code>ORDER BY<\/code> clause, this makes no sense and without the <code>PARTITION BY<\/code> clause, the whole table is assumed. The row and range clause would make no sense. No surprises here.<\/p>\n<p>Basically, you\u2019re going to take the result set, order it and partition it into <code>&lt;positive integer constant&gt;<\/code> groups. To go back to the previous example, if I had a set of 50 voters, I could order them on, say, campaign contribution amounts and then divide them into the fat cats in one group (largest contributors) and rank the rest of them down to zero contributions. Obviously, we have some problems with ties. I\u2019m sure that a lot of voters on my mailing list haven\u2019t sent me anything and just want to get the free coffee mug or whatever I\u2019m using is a bribe.<\/p>\n<p>If the total number of rows is divisible by the number of groups, the rows will be evenly distributed among the groups. For example, if the total number of rows is 50, and there are five groups, each bucket will contain 10 rows.<\/p>\n<p>If the number of rows in a partition is not divisible by <code>&lt;positive integer constant&gt;<\/code>, I will have odd men left over and need a rule for signing them to some group. The rule, which is arbitrary, is that larger groups come before smaller groups in the order specified by the <code>OVER<\/code> clause. For example if the total number of rows is 53 and the number of groups is five, the first three groups will have 11 rows and the two remaining groups will have 10 rows each.<\/p>\n<p><code>NTILE()<\/code> is not deterministic. In fact, it\u2019s really pretty weird. Consider this quick example, which lacks a key, and is therefore not a valid table:<\/p>\n<pre>CREATE TABLE Foobar (x INTEGER NOT NULL);\r\n\r\nINSERT INTO Foobar\r\nVALUES (1), (2), (2), (3), (3), (4), (4), (5);\r\n\r\nSELECT x, NTILE(4) OVER (ORDER BY x) AS quad FROM Foobar;\r\nx   quad\r\n--------\r\n1   1\r\n2   1\r\n2   2\r\n3   2\r\n3   3\r\n4   3\r\n4   4\r\n5   4\r\n<\/pre>\n<p>Now every quadrille group shares a value with the one next to it in sequence! But what else should it do? The description of this function is to split the original data into &lt;positive integer constant&gt; groups. If I wanted to get rid of ties, I would have had to handle this in my query.<\/p>\n<h2>Summary<\/h2>\n<p>The message of this little thought piece is that \u201cit\u2019s really hard to put together leprechauns from leprechaun parts\u201d \u2013 which is really, kinda gruesome when you think about it. But at least I hope you\u2019ll think about your data, when you try to aggregate from a raw set of the rows.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>When we have to deal with and store a lot of data, it makes sense to aggregate it so that we store only the information we actually need.  If we get this right, this works well, but the design of the system takes care and thought because the problems can be subtle and various. Joe Celko describes some of the ways that things can go wrong and end up providing incorrect, inaccurate or misleading results.&hellip;<\/p>\n","protected":false},"author":96214,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[30,143531],"tags":[5134],"coauthors":[6781],"class_list":["post-71677","post","type-post","status-publish","format-standard","hentry","category-opinion-pieces","category-t-sql-programming-sql-server","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/71677","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/96214"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=71677"}],"version-history":[{"count":4,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/71677\/revisions"}],"predecessor-version":[{"id":71684,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/71677\/revisions\/71684"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=71677"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=71677"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=71677"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=71677"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}