The problem with averages

Comments 0

Share to social media

Broadly speaking, the function of statistics is to reduce a set of data to a single value which has some meaning. The most common example of this by far is the average, that we learned in school. This is more properly called the arithmetic mean. In fact, it’s so common that it is actually a built-in aggregate function in SQL as the AVG(<expression>) function. The problem is, while it simple to compute, it has all kinds of problems.

The average of a set of data value is supposed to be the “central value” of a set of data. Mathematically, it is defined as the ratio of summation of all the data to the number of units present in the set. In terms of statistics, the average of a given set of numerical data is also called arithmetic mean. For example, the average of 2, 3 and 4 is (2+3+4)/3 = 9/3 =3. So here 3 is the central value of this set.

But there are problems. The first question you have to ask is, does it make any sense to add up the values in your data set? What if my data set is the colors of bikinis sold during the 2021 swimsuit season? The question makes no sense because color is a discrete, nominal variable and doing math on colors is absurd. One of my favorite T-shirts that illustrates this problem beautifully reads “on a scale from 1 to 10, what color is your favorite letter of the alphabet?” It illustrates how some scales just don’t work for some data sets. However, I might be able to ask in a meaningful way, what is the average size of the swimwear sold in that swimsuit season. This assumes I have a scale for the sizes that is pretty much continuous numeric values makes sense. I am looking for an absolute, an interval scale, or a ratio scale for my data.

Oops! This doesn’t quite work out either. The complete bikini consists of a top and a bottom, which might be bought is separates today. This is my “unit of work” which I have to assemble before I can do any sort of statistics on it. I am going to assume there is probably a correlation among the separates, but it is not perfect, which is why they are sold as separates. The classic example of this Aggregation fallacy is when you see the first name that is most common on earth is “Mohamed” and the most common last name is “Wang”, so you’d thus conclude that the most common name on earth must be “Mohamed Wang” based on your data.

The next obvious question is, “is there actually any tendency to only one central value?” Your data could be random garbage, thus the average it will produce will also be random garbage. Socialist countries where everyone is either very, very rich (the ruling elite) or very, very poor (the rest of the country) will have no actual average people in the population!

Another problem is that averages are strongly influenced by outliers. For example, Elon Musk recently moved to Austin, Texas. The net worth of the “average citizen of Austin” just jumped up substantially. However, I noticed no increase in my personal net worth, and neither did anyone else, with the exception of one or two real estate agents.

It could get even worse. A severe distribution of data could be a multi-hump camel, where the data is clustered, distinct separate groups. Is the average college athlete, a large varsity football player, a golfer, or a member of the girls’ swim team? Where did you take your sample?

This leads us to a thing called Simpson’s Paradox, which has nothing to do with Homer Simpson. This is when a data set aggravates to a whole with one trend, but the components that went into it show the opposite trend. Consider US median wage decline. From 2000 to 2013, the median US wage rose about 1%, adjusted for inflation. However, over the same period, the median wage for high school dropouts, high school graduates with no college education, people with some college education, and people with Bachelor’s or higher degrees have all decreased.

In other words, in every educational subgroup the median wage was lower in 2013 than it was in 2000. How can both things be true? The workforce changed over those 13 years: there are now many more college graduates (who get higher-paying jobs) than there were in 2000, but wages for college graduates collectively have fallen at a much slower rate (down 1.2%) than for those of lower educational attainment whose wages have fallen precipitously, down 7.9% for high school dropouts. The growth in the proportion of college graduates swamps the wage decline for specific groups.

Geometric Mean

The geometric mean is technically defined as “the n-th root product of n numbers” and it is used when working with percentages, which are derived from values, while the standard arithmetic mean works with the values themselves. This calculation considers the effects of compounding.

The geometric mean is the average value which signifies the central tendency of the set of numbers by taking the root of the product of their values. Basically, we multiply the (n) values together and take the n-th root of the results numbers, where (n) is the total number of values. For example: for a given set of two numbers such as 8 and 1, the geometric mean is equal to √(8×1) = √8 = 2√2.

In general, given the set of observations { x₁, x₂, ..., xₙ}, the formula to calculate the geometric mean is:

n√ (x₁ · x₂ · ... · xₙ)

This calculation isn’t as formidable as it looks if you remember logarithms.

Application of Geometric Mean

The geometric mean has advantages over the arithmetic mean for particular applications. It is used in stock indexes because many of the value line indexes which are used by financial departments make use of the geometric mean. For example, it’s used to calculate the annual return on the investment portfolio, in finance to find the average growth rates which are also known as the compounded annual growth rate (CAGR) and in biological studies like cell division and bacterial growth rate. In general, look for something with a growth rate.

Properties of the Geometric Mean:

1) The geometric mean for the given data set is always less than the arithmetic mean for the data set.

2) If each value in the data set is substituted by the geometric mean, then the product of the values remains unchanged.

3) The ratio of the corresponding observations of the geometric mean in two series is equal to the ratio of their geometric means.

4) The products of the corresponding items of the geometric mean in the two series are equal to the product of their geometric mean.

5) The geometric mean is less influenced by outliers than the arithmetic mean.

5) You can use an online Geometric Mean Calculator or use the GEOMEAN function in Excel.

Harmonic Mean

This is another one of what are called the Pythagorean means. They get this name because originally, they were defined geometrically by constructions. The harmonic mean is the reciprocal of the arithmetic mean of the reciprocals of the observations. It has some nice properties. The most common examples of ratios are that of speed and time, work and time, etc.

What is the Definition of the Term Harmonic Mean?

The Harmonic Mean gives less weight to the larger values and more weight to the smaller values to balance the values properly. The harmonic mean is often used to calculate the average of the ratios or rates of the given values because it equalizes the weights of each data point and avoids problems with outliers.

Formula of Harmonic Mean

Since the harmonic mean is the reciprocal of the average of reciprocals, the formula to define the harmonic mean Is a simple definition, but with some caveats:

Given the data set {x1, x2, x3,…, xn}

Harmonic Mean(H) = n / ((1/x1)+(1/x2)+(1/x3)+…+(1/xn))

Given data: {8, 9, 6, 11, 10, 5}

Harmonic mean = 6/((⅛)+(1/9)+(⅙)+(1/11)+(1/10)+(⅕))

= 6/0.7936

= 7.560 (to 3 places)

The Harmonic mean gets its name from Pythagorean theory of music, length of strings on a stringed instrument and the chords they produce. It is also used in computing Fibonacci Sequences. By the way, there is also a function in Excel called HARMEAN that you can use.

Properties of Harmonic Mean

1) For all the observations at constant, say c, then the harmonic means calculated of the observations will also be c.

2) The harmonic mean can also be evaluated for the series having any negative values.

3) If any of the values of a given series is zero then its harmonic mean cannot be determined as the reciprocal of zero doesn’t exist. If there are no zeros in the data set, then the relationships among these three means will be (Arithmetic mean > Geometric mean > Harmonic mean).

The harmonic mean is least affected by fluctuation in sampling. But we have to have a complete sampling of data elements, the term should all be positive and none of them can be zero.

Conclusion

To avoid problems like Elon Musk moving to your city and throwing off the average income, statisticians use other methods like the Harmonic Mean and Geometric Mean to more accurately summarize a set of data.

 

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.