Correlation is not Causation
Database people are often not data mining people. Oh, we can do a few simple descriptive statistics because they are built into SQL products these days; and we, at least, can produce an average with
AVG() when called upon. But the real purpose of data is to become information. We want to have some idea at a higher level as to what the data means. The ultimate goal is prediction of the future by understanding the present. We want to know cause and effect.
People forget, or don’t know, that correlation is not causation. A necessary cause is one which must be present for an effect to happen — a car has to have gas to run. A sufficient cause will bring about the effect by itself — dropping a hammer on your foot will make you scream in pain, but so will having your hard drive crash. A contributory cause is one which helps the effect along, but would not be necessary or sufficient by itself to create the effect. There are also coincidences, where one thing happens at the same time as another, but without a causal relationship.
‘A correlation between two
measurements, say X and Y,
is basically a formula that
allows you to predict one
measurement given the
other, plus or minus some
A correlation between two measurements, say X and Y, is basically a formula that allows you to predict one measurement given the other, plus or minus some error range. For example, if I lock a cannon at a certain angle and keep shooting it with the same amount of gunpowder, I could expect to place the cannon ball, within, say, a two meter radius of the target most of the time. Once in awhile, the cannonball will be dead on target and other times it could be several meters away, but most of the time I would be within my target area. But how well do the angle and the weight of the gunpowder predict hitting the target?
We like linear expressions and polynomials because the math is easier, but life is not always that kind. You always start with a simple linear correlation; if you are lucky, it works and if not then you have a start. The strength of the prediction is called the coefficient of correlation. There are several kinds, but Pearson’s r is the most common one, and is denoted by the variable r where (-1 <= r <= 1). A coefficient of correlation of ‘negative one’ is absolute negative correlation; when X happens, then Y never happens. Positive one is “cause and effect” or perfect correlation; when X happens, then Y also happens. Zero coefficient of correlation means that X and Y happen independently of each other.
In the real world, you expect stronger correlations when you deal with physics, machinery and systems without a lot of variables. But when you deal with people, if you get r = 0.7 or better then you need to look for reasons. You are probably dealing with a restricted population and cannot generalize.
Why variables associate
There are five ways two variables can be related to each other.
- The truth could be that X causes Y. You can estimate the chirp rate of a cricket from the temperature in Fahrenheit using the formula (chirps = 3.777 * degrees – 137.22), with r = 0.9919 accuracy. However, nobody believes that crickets cause temperature changes.
- The truth could be that Y causes X, case two. This is where your get a logical fallacy and superstitions. American supply planes in the Pacific theater in World War II would drop cargo via parachute on primitive islands where the US and Japan had military bases during and after World War II. The Islanders went from the Stone Age to the 20-th century overnight; medicine, canned food, canvas tents, steel tools, and sometimes weapons are very impressive. And even more so when they drop from the sky. Local religions grew up around imitating what they had seen the soldiers do. They carved wooden headphones to talk to the planes, built control towers, learned to wave landing signal flags and lit signal fires on runways. Every now and then at first signal fires worked by tricking an off-course cargo plane to make a drop. This behavior is called “sympathetic magic” and appears in many cultures. You mimic the effects to get the cause. The war ended and the airbases were abandoned, so the “Manna from Heaven” stopped. But even today, the John Frum cult is still active on the island of Tanna, Vanuatu.
- The third case is that X and Y interact with each other. Supply and demand curves are an example where as one goes up, the other goes down (negative feedback in computer terms). A more horrible example is drug addiction where the user requires larger and larger doses to get the desired effect (positive feedback in computer terms), as opposed to a habituation where the usage hits an upper level and stays there.
- The fourth case is that any relationship is pure chance. Any two trends in the same direction will have some correlation, so it should not surprise you that once in awhile, two will match very closely. The Latin name for this fallacy is “Post hoc ergo propter hoc”; it means “after this, therefore because of this”, and it is best seen in a place with gambling. The dice do not know that you are wearing your lucky socks. But if all you remember is you won when you wore the lucky socks and/or lost when you did not, then you create a superstition for yourself.
- The final case is where the two variables are effects of another variable which is outside the study. The most common unseen variables are changes in a common environment. For example, severe hay fever attacks go up when corn prices go down. They share a common environmental element — good weather. Good weather means a bigger corn crop and hence lower prices, but it also means more ragweed and pollen and hence more hay fever attacks. Likewise, spouses, who live pretty much the same lifestyle, will tend to have the same medical problems from a common shared environment and set of habits.
If you assume the wrong one of these five, you can create superstition. Even experts get it wrong and cause waves of panic in popular culture. I was looking at some old magazine columns of mine from ten years ago. The more things change, the more they stay the same. The “Fear Du Jour” back then was “second hand” cigarette smoke and cell phones giving you brain cancer; today we have Global Warming (which replaced the previous “Pop Dread” of a new Ice Age) and a second Great Depression. I remember a study done on popular fears which used the term “Pop Dread” and gave a list of cycles that Western media culture goes through; the author classified them by financial, ecological, cultural, medical and so forth. Each had a regular period of so many years when it would dominate the popular press. There would always be experts who had been warning us about the new danger and a ton of statistics to prove they were right.
The 1994 March issue of DISCOVERY magazine had commentary columns entitled “Counting on Dyscalculia” by John Allen Paulos, the man who coined the word “innumeracy” in his 1990 best seller of the same title. His particular topic was health statistics since those create a lot of “pop dread” when they get played up in the media.
One of his examples in the article was a widely covered lawsuit in which a man alleged a causal connection between his wife’s frequent use of a cellular phone and her subsequent brain cancer. But brain cancer is a rare disease that strikes approximately 7 out of 100,000 people per year. Given the large population of the United States, this is still about 17,500 new cases per year and that number has held pretty steady for years.
There are over 10 million cellular phone users in the United States. If there were a causal relationship, then there would be an increase in cases as cellular phone usage increased. In fact, if we found that there were less than 7000 cases among cellular phone users we could use the same argument to “prove” that cellular phones prevent brain cancer.
In 2005, a Danish study looked at the cell phone usage of brain tumor patients and concluded that the two were unlikely linked. More recent studies in France and Norway concluded the same thing. London’s Institute of Cancer Research study showed that there is no risk of developing brain cancer from cell phone usage.
The study investigated 966 brain cancer patients, and showed that there was no risk of gliomas, a serious type of brain cancer, when using a cell phone over time. The nice part of such a study is that you have all the data in the billing records – first usage, years of ownership, the cumulative number of calls and the hours of use were all there. They also looked at the type of cell phone since earlier cell phones used higher power analog signals compared to modern digital phones. It did not matter; there was no increased risk of developing brain cancer.
The original version of this was that living near power lines was a cause of cancer. That hysteria started with one flawed epidemiogical study in 1979. Hundreds of follow up studies found no correlation at all. But a whole industry was built on re-fitting houses to shield them from EMF (Electro Magnetic Fields) and selling expensive detectors. Myself, I just add another layer of aluminum foil to my hat I use to keep the Martians from reading my brain waves. It is a nice simple engineering solution that does not involve government intervention in my life.
Logic does not stop anecdotal stories or exceptional cases from over-ridding math. The current line is that Senator Edward (“Ted”) Kennedy’s lethal brain cancer could have been prevented if he had not used his cell phone so much. Celebrities carry a lot of weight in the popular press, so if you love the Kennedy family, you will fight brain tumors by opposing cell phones and sending me money. Senator Arlen Specter was diagnosed and treated for a brain cancer in 1993 wants Senate hearings on the matter in September. This will give us a run of people with brain cancers, those who have lost loved ones to the disease (actually there are many different kinds, but don’t let that ruin a good story) and a host of “cell phone danger” experts on television.
Not exactly logic, but if you scream it loudly and dance around, you can gain political power and funding. So the question becomes how do we cheat without outright lying? The confidence level is related to the coefficient of correlation, but it is expressed as a percentage. It says that x% of the time the relationship you are seeing would not happen by chance. It also gives a plus or minus percentage of error. You have heard that used in surveys that are quoted on television and radio. The bigger the sample (assuming it is a true random sample), the smaller that plus or minus interval.
The Confidence Level
A 99% confidence level is the standard for serious scientific studies. A 95% confidence level is the one used in industry, surveying and such things. A 90% confidence level doubles the chance of errors and lets you slip it a lot of data to meet your agenda. Below that, you are using garbage stats for exploring or fraud.
If you would like to play around with the concept, go to http://www.raosoft.com/samplesize.html. This company has made survey software for many years and their website has an on-line sample size calculator. You provide the confidence level, population size and some other simple values. Push the button and get the minimum sample size you need back.
Calculating Pearson’s r
So, after arming you with several cautions, here is a simple correlation in SQL. There are many ways of calculating correlation which will not give the same coefficient for the same data, but are all valid measures of the strength of association, and you need to choose the right one depending on the nature of the data. One of the most useful ones is Pearson’s Product Moment Coefficient (Pearson’s r), or the linear correlation coefficient. It measures the strength of the linear association between two variables. It is only possible to draw clear inferences from a correlation based on sample data if the data distribution has bivariate normality and is of a reasonable size
The formula translates into SQL in a straightforward manner.
CREATE TABLE Samples
(sample_name CHAR(3) NOT NULL PRIMARY KEY, x FLOAT NOT NULL, y FLOAT NOT NULL);
INSERT INTO Samples
VALUES ('a', 1.0, 2.0),
('b', 2.0, 5.0),
('c', 3.0, 6.0);
-- r = 0.9608
DECLARE @x_avg FLOAT, @y_avg FLOAT;
SELECT @x_avg = AVG(X), @y_avg = AVG(y) FROM Samples;
SELECT SUM((x - @x_avg)*(y - @y_avg))
/ SQRT(SUM(POWER((x - @x_avg), 2)) * SUM(POWER((y - @y_avg), 2)))
AS pearson_r FROM Samples;
But you can use an alternative method that gets round needing to know the averages before you do the aggregation. In the second version, the aggregation is only done once
FROM (SELECT SUM(x) AS SumX,
SUM(Y) AS sumY,
SUM(X*X) AS SumX2,
SUM(Y*Y) AS sumY2,
SUM(X*Y) AS sumXY,
COUNT(*) AS N
Calculating Spearman’s Rank Order Correlation
Spearman’s Rank Order Correlation coefficient is devised to measure the correlation between ordinal rankings, and is much used for checking how well the judges of any subjective assessment such as the quality of ballroom dancing or the breeding of dogs actually agree. Imagine we have two judges ranking the quality of flower arrangements. Each judge ranks the flower arrangements in order. You can then calculate the correlation as the measure of how well the judges agree
CREATE TABLE FlowershowRankings
(arranger_name VARCHAR(10) NOT NULL PRIMARY KEY,
first_judge_ranking INTEGER NOT NULL,
second_judge_ranking INTEGER NOT NULL);
INSERT INTO FlowershowRankings
VALUES ('Bob', 1, 2),
('Andrea', 3, 1),
('Katie', 2, 3),
('Richard', 4, 4),
('Tony', 5 , 6),
('Chris',6 , 5),
('Alice', 7, 7);
--The Spearman Correlation Coefficient is:
SELECT 1 - (6* SUM(POWER(first_judge_ranking - second_judge_ranking, 2)))
/ CAST((COUNT(*) * (POWER(COUNT(*), 2)-1)) AS FLOAT) AS spearman_rho