The series so far:
- Statistics in SQL: Pearson’s Correlation
- Statistics in SQL: Kendall’s Tau Rank Correlation
- Statistics in SQL: Simple Linear Regressions
- Statistics in SQL: The Kruskal–Wallis Test
- Statistics in SQL: The Mann–Whitney U Test
- Statistics in SQL: Student's T Test
Let’s imagine that we have two variables, X and Y which we then plot using scatter graph.
It looks a bit like someone has fired a shotgun at a wall but is there a relationship between the two variables? If so, what is it? There seems to be a weak positive linear relationship between the two variables here so we can be fairly confident of plotting a trendline.
Here is the data, and we will proceed to calculate the slope and intercept. We will also calculate the correlation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
SET ARITHABORT ON; DECLARE @OurData TABLE ( x NUMERIC(18,6) NOT NULL, y NUMERIC(18,6) NOT NULL ); INSERT INTO @OurData (x, y) SELECT x,y FROM (VALUES (1,32),(1,23),(3,50),(11,37),(-2,39),(10,44),(27,32),(25,16),(20,23), (4,5),(30,41),(28,2),(31,52),(29,12),(50,40),(43,18),(10,65),(44,26), (35,15),(24,37),(52,66),(59,46),(64,95),(79,36),(24,66),(69,58),(88,56), (61,21),(100,60),(62,54),(10,14),(22,40),(52,97),(81,26),(37,58),(93,71), (64,82),(24,33),(112,49),(64,90),(53,90),(132,61),(104,35),(60,52), (29,50),(85,116),(95,104),(131,37),(139,38),(8,124) )f(x,y) SELECT ((Sy * Sxx) - (Sx * Sxy)) / ((N * (Sxx)) - (Sx * Sx)) AS a, ((N * Sxy) - (Sx * Sy)) / ((N * Sxx) - (Sx * Sx)) AS b, ((N * Sxy) - (Sx * Sy)) / SQRT( (((N * Sxx) - (Sx * Sx)) * ((N * Syy - (Sy * Sy))))) AS r FROM ( SELECT SUM([@OurData].x) AS Sx, SUM([@OurData].y) AS Sy, SUM([@OurData].x * [@OurData].x) AS Sxx, SUM([@OurData].x * [@OurData].y) AS Sxy, SUM([@OurData].y * [@OurData].y) AS Syy, COUNT(*) AS N FROM @OurData ) sums; |
It gives the result…
Alpha was calculated by this expression, …
… Beta was calculated thus.
… and Rho by this
The Pearson’s Product Moment correlation can be used to calculate the probability of this being a significant correlation via the Fisher Transformation.
The slope (beta) and intercept (alpha) can be used to calculate any point on the trendline.
We can, for example calculate the value of Y when X is 100 by the equation
y’ = a + bx
Is, in our example
1 |
SELECT 38.656537+(0.202330*100) AS y |
We can calculate the trendline X,Y points by slightly altering the SQL
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 |
SELECT alpha, beta, rho, Minx AS xLineMin, alpha + (beta * Minx) AS yLineMin, Maxx AS xLineMax, alpha + (beta * Maxx) AS yLineMax FROM ( SELECT --Sx,Sy,Sxx,Sxy,Syy,N, Maxx, Minx, Maxy, Miny, ((Sy * Sxx) - (Sx * Sxy)) / ((N * (Sxx)) - (Sx * Sx)) AS alpha, ((N * Sxy) - (Sx * Sy)) / ((N * Sxx) - (Sx * Sx)) AS beta, ((N * Sxy) - (Sx * Sy)) / SQRT((((N * Sxx) - (Sx * Sx)) * ((N * Syy - (Sy * Sy))) ) ) AS rho FROM ( SELECT SUM([@OurData].x) AS Sx, SUM([@OurData].y) AS Sy, SUM([@OurData].x * [@OurData].x) AS Sxx, SUM([@OurData].x * [@OurData].y) AS Sxy, SUM([@OurData].y * [@OurData].y) AS Syy, COUNT(*) AS N, MAX([@OurData].x) AS Maxx, MIN([@OurData].x) AS Minx, MAX([@OurData].y) AS Maxy, MIN([@OurData].y) AS Miny FROM @OurData ) sums ) AlphaBetaRho; |
We can then plot a trend line
We are, of course, assuming a linear relationship and parametric data. With Pearson’s Rho and the number of X-Y pairs, we can go on to estimate the fisher transformation (t= 1.912) and probability of 0.061825.
The PowerShell to do the Gnuplot graphs is here
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
Set-Alias GnuPlot 'C:\Program Files\gnuplot\bin\gnuplot.exe' -Scope Script $gpscriptsPath='PathToMyScripts' $ScriptName='currentScript.gp' $PlotPath='PathToMyGraphsAndPlots $PlotFilename="$plotPath\scatterPlot.gif" #here is the gnuplot script. note that a dollars sign in the script #has to be escaped because we want to add PowerShell variables $TheScript=@" set termoption enhanced set terminal gif font 'Buxton Sketch-bold' size 800,600 set output '$PlotFilename' save_encoding = GPVAL_ENCODING set encoding utf8 set title 'Two Variables' font ',20' set xlabel 'X' font ',14' set ylabel 'Y' font ',14' `$grid << EOD 1 32 1 23 3 50 11 37 -2 39 10 44 27 32 25 16 20 23 4 5 30 41 28 2 31 52 29 12 50 40 43 18 10 65 44 26 35 15 24 37 52 66 59 46 64 95 79 36 24 66 69 58 88 56 61 21 100 60 62 54 10 14 22 40 52 97 81 26 37 58 93 71 64 82 24 33 112 49 64 90 53 90 132 61 104 35 60 52 29 50 85 116 95 104 131 37 139 38 8 124 EOD `$line << EOD -2 38.2 139 66 EOD plot '`$grid' using 1:2 pt 7 ps 1.5 lc rgb "blue" title '', \ '`$line' using 1:2 with line lw 2 lc rgb "red" title '' "@ #we are using UTF8- usually a good idea with these applications. [System.IO.File]::WriteAllLines("$gpscriptsPath\$ScriptName",$TheScript, (New-Object System.Text.UTF8Encoding $False)) #now we simply execute it and admire the gif file containing the plot gnuplot "$gpscriptsPath\$ScriptName" [System.IO.File]::WriteAllLines("$gpscriptsPath\$ScriptName",$TheScript, (New-Object System.Text.UTF8Encoding $False)) |
Load comments