Median Workbench

Comments 12

Share to social media

The workbench requires SQL Server 2008, but is easily modified to 2005.

A Date for an Mock Feud

If you are an older SQL programmer, you will remember back in the 1990’s there were two database magazines on the newsstands, DBMS and DATABASE PROGRAMMING & DESIGN. They started as competitors, but Miller-Freeman bought DBMS from M&T and kept publishing both titles. Later the two titles are merged into INTELLIGENT ENTERPRISE from CMP Publications, which is more management oriented than the two original magazines.

I started a regular freelance column in DATABASE PROGRAMMING & DESIGN in 1990 and switched to DBMS in 1992 as an employee. Chris Date took my old slot back at DATABASE PROGRAMMING & DESIGN. There was a period where we did a version of the famous Fred Allen and Jack Benny mock feud from the classic era of American radio. Chris would run a column with some position on a topic in his column and I would counter it in my next column. It was great fun, it made people think and (most important) it sold magazines and kept me employed.

One of the topics that bounced around the SQL community in those columns was how to compute the Median with SQL. So Chris and I had to get involved in that and toss code around. Today, the Median is easy to do, but it was a good programming exercise in its day.

The ‘Simple’, or financial Median

The Median is not a built-in function in Standard SQL, like AVG(), and it has several different versions. In fact, most descriptive statistics have multiple versions, which is why SQL stayed away from them and left that job to specialized packages like SAS and SPSS. I will not even mention floating point error handling.

The simple Median is defined as the value for which there are just as many cases with a value below it as above it. If such a value exists in the data set, this value is called the statistical Median. If no such value exists in the data set, the usual method is to divide the data set into two halves of equal size such that all values in one half are lower than any value in the other half. The Median is then the average of the highest value in the lower half and the lowest value in the upper half, and is called the financial Median. In English, sort the data, cut it in the middle and look at the guys on either side of the cut.

The financial Median is the most common term used for this Median, so we will stick to it. Let us use Date’s famous Parts table, from several of his textbooks, which has a column for part_wgt in it, like this:

First sort the table by weights and find the three rows in the lower half of the table. The greatest value in the lower half is 12; the smallest value in the upper half is 14; their average, and therefore the Median, is 13. If the table had an odd number of rows, we would have looked at only one row after the sorting.

Date’s First Median

Date proposed two different solutions for the Median. His first solution was based on the fact that if you duplicate every row in a table, the Median will stay the same. The duplication will guarantee that you always work with a table that has an even number of rows. Since Chris (and myself) are strongly opposed to redundant data, this was an unusual approach for him to use.

The first version that appeared in his column was wrong and drew some mail from me and from others who had different solutions. Here is a corrected version of his first solution:

This involves the construction of a table of duplicated values, which can be expensive in terms of both time and storage space. The use of AVG(DISTINCT x) is important because leaving it out would return the simple average instead of the Median. Consider the set of weights (12, 17, 17, 14, 12, 19). The doubled table, Temp1, is then (12, 12, 12, 12, 14, 14, 17, 17, 17, 17, 19, 19). But because of the duplicated values, Temp2 becomes (14, 14, 17, 17, 17, 17), not just (14, 17). The simple average is (96 / 6.0) = 16; it should be (31/2.0) = 15.5 instead.

Celko’s First Median

I found a slight modification of Date’s solution will avoid the use of a doubled table, but it depends on a CEILING() function.

The CEILING() function is to be sure that if there is an odd number of rows in Parts, the two halves will overlap on that value.

Date’s Second Median

Now the war was on! Did you notice that we had to use a lot of self-joins in those days? That stinks. Date’s second solution was based on Celko’s Median, folded into one query:

Date mentions that this solution will return a NULL for an empty table and that it assumes there are no NULLs in the column. But you have to remember Chris Date, unlike Dr. Codd and SQL, does not use NULLs and thinks they should not be in the Relational Model. If there are NULLs, the COUNT(*) will include them and mess up the computations. You will need to get rid of them with a COUNT(part_nbr) instead. Play with some sample data and see the differences.

Murchison’s Median

Others joined in the battle. Rory Murchison of the Aetna Institute had a solution that modifies Date’s first method by concatenating the key to each value to make sure that every value is seen as a unique entity. Selecting the middle values is then a special case of finding the n-th item in the table.

This method depends on being able to have a HAVING clause without a GROUP BY clause, which is part of the ANSI standard but often missed by new programmers.

Occurrence Count Table

I came back with a working table with the values and a tally of their occurrences from the original table. Now that we have this table, we want to use it to construct a summary table that has the number of occurrences of each part_wgt and the total number of data elements before and after we add them to the working table.

Philip Vaughan of San Jose, CA proposed a simple Median technique based on a VIEW with unique weights and number of occurrences and then a VIEW of the middle set of weights.

The MiddleValues VIEW is used to get the Median by taking an average. The clever part of this code is the way that it handles empty result sets in the outermost WHERE clause that result from having only one value for all weights in the table. Empty sets sum to NULL because there is no element to map the index

Median with Characteristic Function

Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a series of articles in SQL Forum magazine on computing the Median (SQL Forum 1993, 1994). They define a characteristic function, which they call delta, using the SIGN() function. The delta or characteristic function accepts a Boolean expression as an argument and returns one if it is TRUE and zero if it is FALSE or UNKNOWN. We can construct the delta function easily with a CASE expression.

The authors also distinguish between the statistical Median, whose value must be a member of the set, and the financial Median, whose value is the average of the middle two members of the set. A statistical Median exists when there is an odd number of items in the set. If there is an even number of items, you must decide if you want to use the highest value in the lower half (they call this the left Median) or the lowest value in the upper half (they call this the right Median).

The left statistical Median of a unique column can be found with this query, if you will assume that we have a column called bin_nbr that represents the storage location of a part in sorted order.

Changing the direction of the test in the HAVING clause will allow you to pick the right statistical Median if a central element does not exist in the set. You will also notice something else about the Median of a set of unique values: It is usually meaningless. What does the Median bin_nbr number mean, anyway? A good rule of thumb is that if it does not make sense as an average, it does not make sense as a Median.

The statistical Median of a column with duplicate values can be found with a query based on the same ideas, but you have to adjust the HAVING clause to allow for overlap; thus, the left statistical Median is found by.

Should Parts contain an even number of entries with the (COUNT(*)/ 2) entry not repeated, this query may return FLOOR(AVG(DISTINCT part_wgt)), as happens when SQL Server computes an average of integers. This can be fixed by changing the inner SELECT to:

Notice that here the left and right medians can be the same, so there is no need to pick one over the other in many of the situations where you have an even number of items. Switching the comparison operators in the two CASE expressions will give you the right statistical Median.

Henderson’s Median

Another version of the financial Median, which uses the CASE expression in both of its forms, is:

This answer is due to Ken Henderson. The only instance in which [[2]] is correct is when Parts has an odd number of entries and the middle (COUNT(*) / 2 + 1) entry is not repeated in the data.

Celko’s Third Median

Another approach involves looking at a picture of a line of sorted values and seeing where the Median would fall. Every value in column part_wgt of the table partitions the table into three sections, the values that are less than part_wgt, equal to part_wgt or greater than part_wgt. We can get a profile of each value with a tabular subquery expression.

Deriving the query is a great exercise in SQL logic and math. Don’t do it unless you just like the exercise. The query finally becomes:

OLAP Median

The new OLAP function in the SQL-99 Standard allow you to replace COUNT() functions with row numberings. Let’s assume we have a table or VIEW named RawData that has only one numeric column. The obvious solution for data with an odd number of elements is:

You can combine both of these into one query with a CASE expression and I leave that as an exercise for the reader.

But there is another question we have not asked about an even number of rows. Assume the data set look like {1, 2, 2, 3, 3, 3} so the usual median would be AVG({2, 3}) = 2.5 as the measure of central tendency. But if we consider where the real trend is, then we would use the set-oriented weighted median with AVG({2, 2, 3, 3, 3}) = 13/5 = 2.6 as the measure of central tendency.

This leads to the final query which can handle odd and even counts and do the weighted Median:

This might be slower than you would like, since the CTE might do two sorts to get the “hi” and “lo” values. At the time of this writing, ROW_NUMBER() is new to all SQL products, so not as much work has gone into optimizing them. A smarter optimizer would map the i-th element of a sorted list of (n) items in ASC order to the (n-i+1)-th element for DESC.

Want to try your own versions?

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Joe's contributions