{"id":70583,"date":"2017-04-13T13:38:50","date_gmt":"2017-04-13T13:38:50","guid":{"rendered":"https:\/\/www.simple-talk.com\/?p=70583"},"modified":"2021-09-29T16:21:15","modified_gmt":"2021-09-29T16:21:15","slug":"statistics-sql-simple-linear-regressions","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/statistics-sql-simple-linear-regressions\/","title":{"rendered":"Statistics in SQL: Simple Linear Regressions"},"content":{"rendered":"<h4>The series so far:<\/h4>\n<ol>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-pearsons-correlation\/\">Statistics in SQL: Pearson\u2019s Correlation<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-kendalls-tau-rank-correlation\/\">Statistics in SQL: Kendall\u2019s Tau Rank Correlation<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-simple-linear-regressions\/\">Statistics in SQL: Simple Linear Regressions<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-kruskal-wallis-test\/\">Statistics in SQL: The Kruskal\u2013Wallis Test<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-mann-whitney-u-test\/\">Statistics in SQL:  The Mann\u2013Whitney U Test<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/statistics-sql-students-t-test\/\">Statistics in SQL:  Student's T Test<\/a><\/li>\n<\/ol>\n\n<p>Let\u2019s imagine that we have two variables, X and Y which we then plot using scatter graph.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" class=\"wp-image-70584\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/04\/c-users-andrew-clarke-skydrive-documents-scatterp.gif\" alt=\"C:\\Users\\andrew.clarke\\SkyDrive\\Documents\\scatterPlot.gif\" \/><\/p>\n<p>It looks a bit like someone has fired a shotgun at a wall but is there a relationship between the two variables? If so, what is it? There seems to be a weak positive linear relationship between the two variables here so we can be fairly confident of plotting a trendline.<\/p>\n<p>Here is the data, and we will proceed to calculate the slope and intercept. We will also calculate the correlation.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SET ARITHABORT ON;\r\n\r\nDECLARE @OurData TABLE\r\n    (\r\n    x NUMERIC(18,6) NOT NULL,\r\n    y NUMERIC(18,6) NOT NULL\r\n    );\r\n  INSERT INTO @OurData\r\n    (x, y)\r\n  SELECT \r\n   x,y\r\n   FROM (VALUES\r\n  (1,32),(1,23),(3,50),(11,37),(-2,39),(10,44),(27,32),(25,16),(20,23),\r\n  (4,5),(30,41),(28,2),(31,52),(29,12),(50,40),(43,18),(10,65),(44,26),\r\n  (35,15),(24,37),(52,66),(59,46),(64,95),(79,36),(24,66),(69,58),(88,56),\r\n  (61,21),(100,60),(62,54),(10,14),(22,40),(52,97),(81,26),(37,58),(93,71),\r\n  (64,82),(24,33),(112,49),(64,90),(53,90),(132,61),(104,35),(60,52),\r\n  (29,50),(85,116),(95,104),(131,37),(139,38),(8,124)\r\n  )f(x,y)\r\n  SELECT \r\n    ((Sy * Sxx) - (Sx * Sxy))\r\n    \/ ((N * (Sxx)) - (Sx * Sx)) AS a,\r\n    ((N * Sxy) - (Sx * Sy))\r\n    \/ ((N * Sxx) - (Sx * Sx)) AS b,\r\n    ((N * Sxy) - (Sx * Sy))\r\n    \/ SQRT(\r\n        (((N * Sxx) - (Sx * Sx))\r\n         * ((N * Syy - (Sy * Sy))))) AS r\r\n    FROM\r\n      (\r\n      SELECT SUM([@OurData].x) AS Sx, SUM([@OurData].y) AS Sy,\r\n        SUM([@OurData].x * [@OurData].x) AS Sxx,\r\n        SUM([@OurData].x * [@OurData].y) AS Sxy,\r\n        SUM([@OurData].y * [@OurData].y) AS Syy,\r\n        COUNT(*) AS N\r\n        FROM @OurData\r\n      ) sums;<\/pre>\n<p>It gives the result\u2026<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"279\" height=\"39\" class=\"wp-image-70585\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/04\/word-image-24.png\" \/><\/p>\n<p>Alpha was calculated by this expression, \u2026<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"197\" height=\"55\" class=\"wp-image-70586\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/04\/word-image-25.png\" \/><\/p>\n<p>\u2026 Beta was calculated thus.<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"105\" height=\"52\" class=\"wp-image-70587\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/04\/word-image-26.png\" \/><\/p>\n<p>\u2026 and Rho by this<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"239\" height=\"67\" class=\"wp-image-70588\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/04\/word-image-27.png\" \/><\/p>\n<p>The Pearson\u2019s Product Moment correlation can be used to calculate the probability of this being a significant correlation via the Fisher Transformation.<\/p>\n<p>The slope (beta) and intercept (alpha) can be used to calculate any point on the trendline.<\/p>\n<p>We can, for example calculate the value of Y when X is 100 by the equation <br \/>\n y\u2019 = a + bx<\/p>\n<p>Is, in our example<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">SELECT 38.656537+(0.202330*100) AS y<\/pre>\n<p>We can calculate the trendline X,Y points by slightly altering the SQL<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true \">SELECT alpha, beta, rho,\r\n    Minx AS xLineMin, alpha + (beta * Minx) AS yLineMin, \r\n  Maxx AS xLineMax,  alpha + (beta * Maxx) AS yLineMax\r\n  FROM\r\n      (\r\n      SELECT --Sx,Sy,Sxx,Sxy,Syy,N,\r\n        Maxx, Minx, Maxy, Miny,\r\n        ((Sy * Sxx) - (Sx * Sxy))\r\n        \/ ((N * (Sxx)) - (Sx * Sx)) AS alpha,\r\n        ((N * Sxy) - (Sx * Sy))\r\n        \/ ((N * Sxx) - (Sx * Sx)) AS beta,\r\n        ((N * Sxy) - (Sx * Sy))\r\n        \/ SQRT((((N * Sxx) - (Sx * Sx))\r\n                * ((N * Syy - (Sy * Sy)))\r\n               )\r\n              ) AS rho\r\n        FROM\r\n          (\r\n          SELECT SUM([@OurData].x) AS Sx, SUM([@OurData].y) AS Sy,\r\n            SUM([@OurData].x * [@OurData].x) AS Sxx,\r\n            SUM([@OurData].x * [@OurData].y) AS Sxy,\r\n            SUM([@OurData].y * [@OurData].y) AS Syy, COUNT(*) AS N,\r\n            MAX([@OurData].x) AS Maxx, MIN([@OurData].x) AS Minx,\r\n            MAX([@OurData].y) AS Maxy, MIN([@OurData].y) AS Miny\r\n            FROM @OurData\r\n          ) sums\r\n      ) AlphaBetaRho;<\/pre>\n<p>We can then plot a trend line<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"800\" height=\"600\" class=\"wp-image-70589\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2017\/04\/c-users-andrew-clarke-skydrive-documents-scatterp-1.gif\" alt=\"C:\\Users\\andrew.clarke\\SkyDrive\\Documents\\scatterPlot.gif\" \/><\/p>\n<p>We are, of course, assuming a linear relationship and parametric data. With Pearson\u2019s Rho and the number of X-Y pairs, we can go on to estimate the fisher transformation (t= 1.912) and probability of 0.061825.<\/p>\n<p>The PowerShell to do the Gnuplot graphs is here<\/p>\n<pre class=\"theme:powershell-ise lang:ps decode:true  \">Set-Alias GnuPlot 'C:\\Program Files\\gnuplot\\bin\\gnuplot.exe' -Scope Script\r\n$gpscriptsPath='PathToMyScripts'\r\n$ScriptName='currentScript.gp'\r\n$PlotPath='PathToMyGraphsAndPlots\r\n$PlotFilename=\"$plotPath\\scatterPlot.gif\"\r\n#here is the gnuplot script. note that a dollars sign in the script\r\n#has to be escaped because we want to add PowerShell variables\r\n$TheScript=@\"\r\nset termoption enhanced\r\nset terminal gif font 'Buxton Sketch-bold' size 800,600   \r\nset output '$PlotFilename'\r\nsave_encoding = GPVAL_ENCODING\r\nset encoding utf8\r\nset title 'Two Variables' font ',20'\r\nset xlabel 'X' font ',14'\r\nset ylabel 'Y' font ',14'\r\n\r\n`$grid &lt;&lt; EOD\r\n1\t32\r\n1\t23\r\n3\t50\r\n11\t37\r\n-2\t39\r\n10\t44\r\n27\t32\r\n25\t16\r\n20\t23\r\n4\t5\r\n30\t41\r\n28\t2\r\n31\t52\r\n29\t12\r\n50\t40\r\n43\t18\r\n10\t65\r\n44\t26\r\n35\t15\r\n24\t37\r\n52\t66\r\n59\t46\r\n64\t95\r\n79\t36\r\n24\t66\r\n69\t58\r\n88\t56\r\n61\t21\r\n100\t60\r\n62\t54\r\n10\t14\r\n22\t40\r\n52\t97\r\n81\t26\r\n37\t58\r\n93\t71\r\n64\t82\r\n24\t33\r\n112\t49\r\n64\t90\r\n53\t90\r\n132\t61\r\n104\t35\r\n60\t52\r\n29\t50\r\n85\t116\r\n95\t104\r\n131\t37\r\n139\t38\r\n8\t124\r\nEOD\r\n`$line &lt;&lt; EOD\r\n-2   38.2\r\n139  66 \r\n EOD\r\n\r\n\r\nplot '`$grid' using 1:2 pt 7 ps 1.5 lc rgb \"blue\" title '', \\\r\n     '`$line' using 1:2 with line lw 2 lc rgb \"red\" title ''\r\n\"@\r\n#we are using UTF8- usually a good idea with these applications.\r\n[System.IO.File]::WriteAllLines(\"$gpscriptsPath\\$ScriptName\",$TheScript, (New-Object System.Text.UTF8Encoding $False))\r\n#now we simply execute it and admire the gif file containing the plot\r\ngnuplot \"$gpscriptsPath\\$ScriptName\" \r\n\r\n[System.IO.File]::WriteAllLines(\"$gpscriptsPath\\$ScriptName\",$TheScript, (New-Object System.Text.UTF8Encoding $False))  \r\n<\/pre>\n<p><!--\n\nand the subsequent..\n\n\n\n\n\n<ul>\n\t\n\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-simple-linear-regressions\/\">Statistics in SQL: Simple Linear Regressions<\/a><\/li>\n\n\n<\/ul>\n\n\n\n\n\nmore--><\/p>\n","protected":false},"excerpt":{"rendered":"<p>Although linear regressions can get complicated, most jobs involving the plotting of a trendline are easy.  Simple Linear Regression is handy for the SQL Programmer in making a prediction of a linear trend and giving a figure for the level probability for the prediction, and what is more, they are easy to do with the aggregation that is built into SQL.&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528,2,143531],"tags":[43637],"coauthors":[6813],"class_list":["post-70583","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","category-blogs","category-t-sql-programming-sql-server","tag-statisticsinsql"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/70583","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=70583"}],"version-history":[{"count":8,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/70583\/revisions"}],"predecessor-version":[{"id":71972,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/70583\/revisions\/71972"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=70583"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=70583"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=70583"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=70583"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}