{"id":71959,"date":"2017-08-03T11:29:58","date_gmt":"2017-08-03T11:29:58","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=71959"},"modified":"2021-09-29T16:21:09","modified_gmt":"2021-09-29T16:21:09","slug":"statistics-sql-students-t-test","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/other\/statistics-sql-students-t-test\/","title":{"rendered":"Statistics in SQL: Student&#8217;s t-test"},"content":{"rendered":"<h4>The series so far:<\/h4>\n<ol>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-pearsons-correlation\/\">Statistics in SQL: Pearson\u2019s Correlation<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-kendalls-tau-rank-correlation\/\">Statistics in SQL: Kendall\u2019s Tau Rank Correlation<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-simple-linear-regressions\/\">Statistics in SQL: Simple Linear Regressions<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-kruskal-wallis-test\/\">Statistics in SQL: The Kruskal\u2013Wallis Test<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-mann-whitney-u-test\/\">Statistics in SQL:  The Mann\u2013Whitney U Test<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/statistics-sql-students-t-test\/\">Statistics in SQL:  Student's T Test<\/a><\/li>\n<\/ol>\n\n<p>William Sealy Gosset was one of the bright Oxbridge graduates hired by Guinness Brewery in Dublin to come up with a scientific basis for improving the quality of beer.<\/p>\n<p>The problem was that brewers did things that were intended to improve the product and to check the quality of the ingredients but it was all rather uncertain. It was difficult to be certain that changes were definitely having the intended effect and it was a problem to be certain that, for example, limited measurements of moisture content in the barley were representative of the entire batch.<\/p>\n<p>Guinness wanted William Gosset to find a more scientific way of driving up the quality of their product and the efficiency of production. Gosset was a statistician and he was convinced that it was possible to test small samples of a normally distributed variable (e.g. moisture content in barley) and come to a result that could be applied to the whole population (e.g. the entire shipment). One could, in effect, judge an improvement in the product without drinking all the beer, by assessing samples of the beer in the laboratory. At the time, there was no simple parametric test to do it.<\/p>\n<p>He soon joined Karl \u00a0Pearson&#8217;s team on sabbatical and subsequently developed the &#8220;t-statistic&#8221;. He was initially uncertain of the t distribution and so used measurements of the heights and left middle finger lengths of criminals in a local prison to work out the t distribution. He was then able to derive the mathematical theory of it and published his work under the pseudonym of &#8220;Student&#8221;.<\/p>\n<p>The t-statistic applies to a range of tests such as the confidence intervals for the difference between two population means and linear regression analyses all of which are based on the assumption that the variables being measured are distributed normally in the population. (if we cannot assume that, we must use the non-parametric Mann-Whitney test) It includes &#8220;Student&#8217;s&#8221; t-test for assessing the statistical \u00a0significance of the difference between two sample means. It basically tells you how confident you can be that any given range would contain the true mean.<\/p>\n<p>there are several variations of the T Test including<\/p>\n<ul>\n<li>One Sample Mode, where you are testing against a hypothesized mean<\/li>\n<li>Independent Samples Mode, where there are two independent groups for equal mean<\/li>\n<li>Paired Samples Mode, testing two interdependent groups for equal mean<\/li>\n<li>Test of hypothesis about the coefficient of correlation<\/li>\n<\/ul>\n<p>We\u2019ll be doing the second of these, the where we test the hypothesis of the equality of means where the variances are equal. How do we know that the variances are equal? Normally, you&#8217;d check by using a Fisher\u2019s F-test to verify the homoskedasticity (homogeneity of variances). or use Levene&#8217;s test: we\u2019ll end up doing the latter but not in SQL!<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true \">DECLARE @t NUMERIC(18,2) --the value of t\r\nDECLARE @df INT --the degrees of freedom\r\nDECLARE @Scores TABLE (sample int, variable int) --the test scores\r\nINSERT INTO @scores (sample, variable)\r\nVALUES\r\n(1,27),(1,30),(1,15),(1,17),(1,21),(1,24),(1,26),(1,20),(1,21),(1,36),\r\n(1,21),(1,31),(1,23),(1,32),(1,30),(1,20),(1,22),(1,32),(1,19),(1,25),\r\n(2,2),(2,23),(2,32),(2,37),(2,14),(2,12),(2,15),(2,23),(2,31),(2,17),\r\n(2,8),(2,13),(2,22),(2,14),(2,22),(2,12),(2,14),(2,22),(2,26),(2,16)\r\n\/*\r\nWe have two independent samples. sample 1 is the experimental 2 is the control.\r\nThey could be the moisture content of barley or the thickness of the shell\r\nof shellfish. It might even be the diameter of tomatoes. \r\n\r\n\r\nTo make things clearer, we'll put all the calculations into variables\r\n         Standard deviation Average            number in sample*\/\r\nDECLARE @st1 NUMERIC(18,2), @x1 NUMERIC(18,2), @N1 NUMERIC(18,2),\r\n        @st2 NUMERIC(18,2), @x2 NUMERIC(18,2), @N2 NUMERIC(18,2);\r\n\/* so we get the standard deviation, avarage and count for each sample *\/\r\n--the experimental\r\nSELECT @st1 = StDev(s.variable), @x1 = Avg(Convert(NUMERIC(18, 2), variable)), @N1 = Count(*)\r\n  FROM @Scores AS s\r\n  WHERE s.sample = 1;\r\n--the control\r\nSELECT @st2 = StDev(s.variable), @x2 = Avg(Convert(NUMERIC(18, 2), variable)), @N2 = Count(*)\r\n  FROM @Scores AS s\r\n  WHERE s.sample = 2;\r\n--and we do the calculation\r\nSELECT @t=(@x1 - @x2) \/ Sqrt((Square(@st1) \/ @N1) + (Square(@st2) \/ @N2)),\r\n       @df=@N1+@N2-2\r\n\r\nSELECT 'The t value was '+ Convert(VARCHAR(20),@t)+ ' with '+ Convert(VARCHAR(20),@df)+' degrees of freedom'\r\n\/*\r\nNow we just have to look up the critical T values for the degrees of freedom of the test.\r\n*\/ \r\n;WITH CriticalTValues (Df,p05,p025,p01,p005)\r\nAS\r\n(SELECT Df,p05,p025,p01,p005 from ( VALUES \r\n(1,12.71,25.45,63.66,127.32),\r\n(2,4.30,6.20,9.92,14.09),\r\n(3,3.18,4.17,5.84,7.45),\r\n(4,2.78,3.50,4.60,5.60),\r\n(5,2.57,3.16,4.03,4.77),\r\n(6,2.45,2.97,3.71,4.32),\r\n(7,2.36,2.84,3.50,4.03),\r\n(8,2.31,2.75,3.36,3.83),\r\n(9,2.26,2.68,3.25,3.69),\r\n(10,2.23,2.63,3.17,3.58),\r\n(11,2.20,2.59,3.11,3.50),\r\n(12,2.18,2.56,3.05,3.43),\r\n(13,2.16,2.53,3.01,3.37),\r\n(14,2.14,2.51,2.98,3.33),\r\n(15,2.13,2.49,2.95,3.29),\r\n(16,2.12,2.47,2.92,3.25),\r\n(17,2.11,2.46,2.90,3.22),\r\n(18,2.10,2.44,2.88,3.20),\r\n(19,2.09,2.43,2.86,3.17),\r\n(20,2.09,2.42,2.84,3.15),\r\n(21,2.08,2.41,2.83,3.14),\r\n(22,2.07,2.41,2.82,3.12),\r\n(23,2.07,2.40,2.81,3.10),\r\n(24,2.06,2.39,2.80,3.09),\r\n(25,2.06,2.38,2.79,3.08),\r\n(26,2.06,2.38,2.78,3.07),\r\n(27,2.05,2.37,2.77,3.06),\r\n(28,2.05,2.37,2.76,3.05),\r\n(29,2.04,2.36,2.76,3.04),\r\n(30,2.04,2.36,2.75,3.03),\r\n(40,2.02,2.33,2.70,2.97),\r\n(60,2.00,2.30,2.66,2.92),\r\n(120,1.98,2.27,2.62,2.86),\r\n(10000,1.96,2.24,2.58,2.81))CriticalValues(Df,p05,p025,p01,p005))\r\nSELECT 'The t value was'+\r\n\tCASE WHEN @t&gt;CriticalTValues.p005 THEN ' highly significant at the 0.005 level'\r\n \t WHEN @t&gt;CriticalTValues.p01 THEN ' significant at the 0.01 level'\r\n \tWHEN @t&gt;CriticalTValues.p025 THEN ' significant at the 0.025 level'\r\n \t WHEN @t&gt;CriticalTValues.p05 THEN ' just significant at the 0.05 level'\r\n\t ELSE  'n''t significant' end\r\n FROM CriticalTValues --and get the lowest version with sufficient degrees of freedom.\r\nWHERE df=(SELECT Min(df) FROM  CriticalTValues ctet WHERE ctet.df&gt;=@df)\r\n<\/pre>\n<p>&#8230;which, when we execute it gives&#8230;<\/p>\n<pre>(40 row(s) affected)\r\n\r\nThe t value was 2.54 with 38 degrees of freedom\r\n\r\n(1 row(s) affected)\r\n\r\n\r\nThe t value was significant at the 0.025 level\r\n\r\n(1 row(s) affected)\r\n<\/pre>\n<p>Now we can simply check these results by running the same test in PSPP, the open source equivalent of SPSS and that gives us &#8230;<\/p>\n<pre class=\"\">Group Statistics\r\n#=========#==#=====#==============#=========#\r\n#    group| N| Mean|Std. Deviation|S.E. Mean#\r\n#===#=====#==#=====#==============#=========#\r\n#scr|1    |20|24.60|          5.71|     1.28#\r\n#   |2    |20|18.75|          8.56|     1.91#\r\n#===#=====#==#=====#==============#=========#\r\n\r\nIndependent Samples Test\r\n#=============#=========#=====================================================#\r\n#             # Levene's|             t-test for Equality of Means            #\r\n#             # Test for|                                                     #\r\n#             # Equality|                                                     #\r\n#             #    of   |                                                     #\r\n#             #Variances|                                                     #\r\n#             #---------+-----------------------------------------------------#\r\n\r\n#             #----+----+----+-----+--------+----------+----------+-----------#\r\n#             #    |    |    |     |        |          |          |    95%    #\r\n#             #    |    |    |     |        |          |          | Confidence#\r\n#             #    |    |    |     |        |          |          |Interval of#\r\n#             #    |    |    |     |        |          |          |    the    #\r\n#             #    |    |    |     |        |          |          | Difference#\r\n#             #    |    |    |     |        |          |          +-----+-----#\r\n#             #  F |Sig.|  t |  df |Sig. (2-|   Mean   |Std. Error|Lower|Upper#\r\n#             #    |    |    |     | tailed)|Difference|Difference|     |     #\r\n#===#=========#====#====#====#=====#========#==========#==========#=====#=====#\r\n#scr|Equal    #2.98|.093|2.54|38.00|    .015|      5.85|      2.30| 1.19|10.51#\r\n#   |variances#    |    |    |     |        |          |          |     |     #\r\n#   |assumed  #    |    |    |     |        |          |          |     |     #\r\n#   |Equal    #    |    |2.54|33.11|    .016|      5.85|      2.30| 1.17|10.53#\r\n#   |variances#    |    |    |     |        |          |          |     |     #\r\n#   |not      #    |    |    |     |        |          |          |     |     #\r\n#   |assumed  #    |    |    |     |        |          |          |     |     #\r\n#===#=========#====#====#====#=====#========#==========#==========#=====#=====#\r\n<\/pre>\n<p>Now, how did I manage to run that query in GNU PSPP? (PSPP is the free software version of SPSS). Hmm. That will have to wait for another blog as it is a bit off-topic!<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Many undergraduates have misunderstood the name &#8216;Students&#8217; in the t-test to imply that it was designed as a simple test suitable for students. In fact it was William Sealy Gosset, an Englishman publishing under the pseudonym Student, who developed the t-test and t distribution in 1908, as a way of making  confident predictions from small sample sizes  of  normally-distributed variables.  As Gosset&#8217;s employer was Guinness, the brewer, Phil Factor takes a sober view of calculating it in SQL.&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2,143531],"tags":[5134],"coauthors":[6813],"class_list":["post-71959","post","type-post","status-publish","format-standard","hentry","category-other","category-t-sql-programming-sql-server","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/71959","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=71959"}],"version-history":[{"count":7,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/71959\/revisions"}],"predecessor-version":[{"id":74369,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/71959\/revisions\/74369"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=71959"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=71959"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=71959"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=71959"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}