Statistics in SQL: Student’s t-test

Many undergraduates have misunderstood the name 'Students' in the t-test to imply that it was designed as a simple test suitable for students. In fact it was William Sealy Gosset, an Englishman publishing under the pseudonym Student, who developed the t-test and t distribution in 1908, as a way of making confident predictions from small sample sizes of normally-distributed variables. As Gosset's employer was Guinness, the brewer, Phil Factor takes a sober view of calculating it in SQL.

The series so far:

  1. Statistics in SQL: Pearson’s Correlation
  2. Statistics in SQL: Kendall’s Tau Rank Correlation
  3. Statistics in SQL: Simple Linear Regressions
  4. Statistics in SQL: The Kruskal–Wallis Test
  5. Statistics in SQL: The Mann–Whitney U Test
  6. Statistics in SQL: Student's T Test

William Sealy Gosset was one of the bright Oxbridge graduates hired by Guinness Brewery in Dublin to come up with a scientific basis for improving the quality of beer.

The problem was that brewers did things that were intended to improve the product and to check the quality of the ingredients but it was all rather uncertain. It was difficult to be certain that changes were definitely having the intended effect and it was a problem to be certain that, for example, limited measurements of moisture content in the barley were representative of the entire batch.

Guinness wanted William Gosset to find a more scientific way of driving up the quality of their product and the efficiency of production. Gosset was a statistician and he was convinced that it was possible to test small samples of a normally distributed variable (e.g. moisture content in barley) and come to a result that could be applied to the whole population (e.g. the entire shipment). One could, in effect, judge an improvement in the product without drinking all the beer, by assessing samples of the beer in the laboratory. At the time, there was no simple parametric test to do it.

He soon joined Karl  Pearson’s team on sabbatical and subsequently developed the “t-statistic”. He was initially uncertain of the t distribution and so used measurements of the heights and left middle finger lengths of criminals in a local prison to work out the t distribution. He was then able to derive the mathematical theory of it and published his work under the pseudonym of “Student”.

The t-statistic applies to a range of tests such as the confidence intervals for the difference between two population means and linear regression analyses all of which are based on the assumption that the variables being measured are distributed normally in the population. (if we cannot assume that, we must use the non-parametric Mann-Whitney test) It includes “Student’s” t-test for assessing the statistical  significance of the difference between two sample means. It basically tells you how confident you can be that any given range would contain the true mean.

there are several variations of the T Test including

  • One Sample Mode, where you are testing against a hypothesized mean
  • Independent Samples Mode, where there are two independent groups for equal mean
  • Paired Samples Mode, testing two interdependent groups for equal mean
  • Test of hypothesis about the coefficient of correlation

We’ll be doing the second of these, the where we test the hypothesis of the equality of means where the variances are equal. How do we know that the variances are equal? Normally, you’d check by using a Fisher’s F-test to verify the homoskedasticity (homogeneity of variances). or use Levene’s test: we’ll end up doing the latter but not in SQL!

…which, when we execute it gives…

Now we can simply check these results by running the same test in PSPP, the open source equivalent of SPSS and that gives us …

Now, how did I manage to run that query in GNU PSPP? (PSPP is the free software version of SPSS). Hmm. That will have to wait for another blog as it is a bit off-topic!