The series so far:
- Statistics in SQL: Pearson’s Correlation
- Statistics in SQL: Kendall’s Tau Rank Correlation
- Statistics in SQL: Simple Linear Regressions
- Statistics in SQL: The Kruskal–Wallis Test
- Statistics in SQL: The Mann–Whitney U Test
- Statistics in SQL: Student's T Test
William Sealy Gosset was one of the bright Oxbridge graduates hired by Guinness Brewery in Dublin to come up with a scientific basis for improving the quality of beer.
The problem was that brewers did things that were intended to improve the product and to check the quality of the ingredients but it was all rather uncertain. It was difficult to be certain that changes were definitely having the intended effect and it was a problem to be certain that, for example, limited measurements of moisture content in the barley were representative of the entire batch.
Guinness wanted William Gosset to find a more scientific way of driving up the quality of their product and the efficiency of production. Gosset was a statistician and he was convinced that it was possible to test small samples of a normally distributed variable (e.g. moisture content in barley) and come to a result that could be applied to the whole population (e.g. the entire shipment). One could, in effect, judge an improvement in the product without drinking all the beer, by assessing samples of the beer in the laboratory. At the time, there was no simple parametric test to do it.
He soon joined Karl Pearson’s team on sabbatical and subsequently developed the “t-statistic”. He was initially uncertain of the t distribution and so used measurements of the heights and left middle finger lengths of criminals in a local prison to work out the t distribution. He was then able to derive the mathematical theory of it and published his work under the pseudonym of “Student”.
The t-statistic applies to a range of tests such as the confidence intervals for the difference between two population means and linear regression analyses all of which are based on the assumption that the variables being measured are distributed normally in the population. (if we cannot assume that, we must use the non-parametric Mann-Whitney test) It includes “Student’s” t-test for assessing the statistical significance of the difference between two sample means. It basically tells you how confident you can be that any given range would contain the true mean.
there are several variations of the T Test including
- One Sample Mode, where you are testing against a hypothesized mean
- Independent Samples Mode, where there are two independent groups for equal mean
- Paired Samples Mode, testing two interdependent groups for equal mean
- Test of hypothesis about the coefficient of correlation
We’ll be doing the second of these, the where we test the hypothesis of the equality of means where the variances are equal. How do we know that the variances are equal? Normally, you’d check by using a Fisher’s F-test to verify the homoskedasticity (homogeneity of variances). or use Levene’s test: we’ll end up doing the latter but not in SQL!
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 |
DECLARE @t NUMERIC(18,2) --the value of t DECLARE @df INT --the degrees of freedom DECLARE @Scores TABLE (sample int, variable int) --the test scores INSERT INTO @scores (sample, variable) VALUES (1,27),(1,30),(1,15),(1,17),(1,21),(1,24),(1,26),(1,20),(1,21),(1,36), (1,21),(1,31),(1,23),(1,32),(1,30),(1,20),(1,22),(1,32),(1,19),(1,25), (2,2),(2,23),(2,32),(2,37),(2,14),(2,12),(2,15),(2,23),(2,31),(2,17), (2,8),(2,13),(2,22),(2,14),(2,22),(2,12),(2,14),(2,22),(2,26),(2,16) /* We have two independent samples. sample 1 is the experimental 2 is the control. They could be the moisture content of barley or the thickness of the shell of shellfish. It might even be the diameter of tomatoes. To make things clearer, we'll put all the calculations into variables Standard deviation Average number in sample*/ DECLARE @st1 NUMERIC(18,2), @x1 NUMERIC(18,2), @N1 NUMERIC(18,2), @st2 NUMERIC(18,2), @x2 NUMERIC(18,2), @N2 NUMERIC(18,2); /* so we get the standard deviation, avarage and count for each sample */ --the experimental SELECT @st1 = StDev(s.variable), @x1 = Avg(Convert(NUMERIC(18, 2), variable)), @N1 = Count(*) FROM @Scores AS s WHERE s.sample = 1; --the control SELECT @st2 = StDev(s.variable), @x2 = Avg(Convert(NUMERIC(18, 2), variable)), @N2 = Count(*) FROM @Scores AS s WHERE s.sample = 2; --and we do the calculation SELECT @t=(@x1 - @x2) / Sqrt((Square(@st1) / @N1) + (Square(@st2) / @N2)), @df=@N1+@N2-2 SELECT 'The t value was '+ Convert(VARCHAR(20),@t)+ ' with '+ Convert(VARCHAR(20),@df)+' degrees of freedom' /* Now we just have to look up the critical T values for the degrees of freedom of the test. */ ;WITH CriticalTValues (Df,p05,p025,p01,p005) AS (SELECT Df,p05,p025,p01,p005 from ( VALUES (1,12.71,25.45,63.66,127.32), (2,4.30,6.20,9.92,14.09), (3,3.18,4.17,5.84,7.45), (4,2.78,3.50,4.60,5.60), (5,2.57,3.16,4.03,4.77), (6,2.45,2.97,3.71,4.32), (7,2.36,2.84,3.50,4.03), (8,2.31,2.75,3.36,3.83), (9,2.26,2.68,3.25,3.69), (10,2.23,2.63,3.17,3.58), (11,2.20,2.59,3.11,3.50), (12,2.18,2.56,3.05,3.43), (13,2.16,2.53,3.01,3.37), (14,2.14,2.51,2.98,3.33), (15,2.13,2.49,2.95,3.29), (16,2.12,2.47,2.92,3.25), (17,2.11,2.46,2.90,3.22), (18,2.10,2.44,2.88,3.20), (19,2.09,2.43,2.86,3.17), (20,2.09,2.42,2.84,3.15), (21,2.08,2.41,2.83,3.14), (22,2.07,2.41,2.82,3.12), (23,2.07,2.40,2.81,3.10), (24,2.06,2.39,2.80,3.09), (25,2.06,2.38,2.79,3.08), (26,2.06,2.38,2.78,3.07), (27,2.05,2.37,2.77,3.06), (28,2.05,2.37,2.76,3.05), (29,2.04,2.36,2.76,3.04), (30,2.04,2.36,2.75,3.03), (40,2.02,2.33,2.70,2.97), (60,2.00,2.30,2.66,2.92), (120,1.98,2.27,2.62,2.86), (10000,1.96,2.24,2.58,2.81))CriticalValues(Df,p05,p025,p01,p005)) SELECT 'The t value was'+ CASE WHEN @t>CriticalTValues.p005 THEN ' highly significant at the 0.005 level' WHEN @t>CriticalTValues.p01 THEN ' significant at the 0.01 level' WHEN @t>CriticalTValues.p025 THEN ' significant at the 0.025 level' WHEN @t>CriticalTValues.p05 THEN ' just significant at the 0.05 level' ELSE 'n''t significant' end FROM CriticalTValues --and get the lowest version with sufficient degrees of freedom. WHERE df=(SELECT Min(df) FROM CriticalTValues ctet WHERE ctet.df>=@df) |
…which, when we execute it gives…
1 2 3 4 5 6 7 8 9 10 |
(40 row(s) affected) The t value was 2.54 with 38 degrees of freedom (1 row(s) affected) The t value was significant at the 0.025 level (1 row(s) affected) |
Now we can simply check these results by running the same test in PSPP, the open source equivalent of SPSS and that gives us …
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 |
Group Statistics #=========#==#=====#==============#=========# # group| N| Mean|Std. Deviation|S.E. Mean# #===#=====#==#=====#==============#=========# #scr|1 |20|24.60| 5.71| 1.28# # |2 |20|18.75| 8.56| 1.91# #===#=====#==#=====#==============#=========# Independent Samples Test #=============#=========#=====================================================# # # Levene's| t-test for Equality of Means # # # Test for| # # # Equality| # # # of | # # #Variances| # # #---------+-----------------------------------------------------# # #----+----+----+-----+--------+----------+----------+-----------# # # | | | | | | | 95% # # # | | | | | | | Confidence# # # | | | | | | |Interval of# # # | | | | | | | the # # # | | | | | | | Difference# # # | | | | | | +-----+-----# # # F |Sig.| t | df |Sig. (2-| Mean |Std. Error|Lower|Upper# # # | | | | tailed)|Difference|Difference| | # #===#=========#====#====#====#=====#========#==========#==========#=====#=====# #scr|Equal #2.98|.093|2.54|38.00| .015| 5.85| 2.30| 1.19|10.51# # |variances# | | | | | | | | # # |assumed # | | | | | | | | # # |Equal # | |2.54|33.11| .016| 5.85| 2.30| 1.17|10.53# # |variances# | | | | | | | | # # |not # | | | | | | | | # # |assumed # | | | | | | | | # #===#=========#====#====#====#=====#========#==========#==========#=====#=====# |
Now, how did I manage to run that query in GNU PSPP? (PSPP is the free software version of SPSS). Hmm. That will have to wait for another blog as it is a bit off-topic!
Load comments