{"id":70558,"date":"2017-04-10T10:50:25","date_gmt":"2017-04-10T10:50:25","guid":{"rendered":"https:\/\/www.simple-talk.com\/?p=70558"},"modified":"2021-09-29T16:21:16","modified_gmt":"2021-09-29T16:21:16","slug":"statistics-sql-kendalls-tau-rank-correlation","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/statistics-sql-kendalls-tau-rank-correlation\/","title":{"rendered":"Statistics in SQL: Kendall\u2019s Tau Rank Correlation"},"content":{"rendered":"<h4>The series so far:<\/h4>\n<ol>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-pearsons-correlation\/\">Statistics in SQL: Pearson\u2019s Correlation<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-kendalls-tau-rank-correlation\/\">Statistics in SQL: Kendall\u2019s Tau Rank Correlation<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-simple-linear-regressions\/\">Statistics in SQL: Simple Linear Regressions<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-kruskal-wallis-test\/\">Statistics in SQL: The Kruskal\u2013Wallis Test<\/a><\/li>\n<li><a href=\"https:\/\/www.simple-talk.com\/blogs\/statistics-sql-mann-whitney-u-test\/\">Statistics in SQL:  The Mann\u2013Whitney U Test<\/a><\/li>\n<li><a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/statistics-sql-students-t-test\/\">Statistics in SQL:  Student's T Test<\/a><\/li>\n<\/ol>\n\n<p>Kendall\u2019s Tau rank correlation is a handy way of determining how correlated two variables are, and whether this is more than chance. If you just want a measure of the correlation then you don\u2019t have to assume very much about the distribution of the variables. Kendall\u2019s Tau is popular with calculating correlations with non-parametric data. Spearman\u2019s Rho is possibly more popular for the purpose, but Kendall\u2019s tau has a distribution with better statistical properties (the sample estimate is close to a population variance) so confidence levels are more reliable, but in general, Kendall\u2019s tau and Spearman\u2019s rank correlation coefficient are very similar. The obvious difference between them is that, for the standard method of calculation, \u00a0Spearman&#8217;s Rank correlation required ranked data as input, whereas the algorithm to calculate Kendall&#8217;s Tau does this for you. \u00a0Kendall&#8217;s Tau consumes any non-parametric data with equal relish.<\/p>\n<p>Kendall\u2019s Tau is easy to calculate on paper, and makes intuitive sense. It deals with the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs of rankings. All observations are paired with each of the others, A concordant pair is one whose members of one observation are both larger than their respective members of the other paired observation, whereas discordant pairs have numbers that differ in opposite directions. Kendall\u2019s Tau-b takes tied rankings into account.<\/p>\n<p>So, let\u2019s imagine that we ask two observers to rank ten glasses of Loire valley wine in order. (These figures aren\u2019t real since I prefer Vouvray and Gros Plant du Pays Nantais)<\/p>\n<table>\n<tbody>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p><strong>Wine<\/strong><\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p><strong>Phil<\/strong><\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p><strong>Tony<\/strong><\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Menetou-Salon<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>1<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>3<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Muscadet-S\u00e8vre et Maine<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>2<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>4<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Quincy<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>3<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>2<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Reuilly<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>4<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>5<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Muscadet-Coteaux de la Loire<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>5<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>11<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Chinon<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>6<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>6<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Muscadet-C\u00f4tes de Grand Lieu<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>7<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>12<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Vouvray<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>8<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>9<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Tourraine<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>9<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>8<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Anjou-Saumur<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>10<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>10<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Sancerre<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>11<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>1<\/p>\n<\/td>\n<\/tr>\n<tr>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>Pouilly-Fum\u00e9<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>12<\/p>\n<\/td>\n<td style=\"padding: 0 5px 0 5px;\">\n<p>7<\/p>\n<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<p>We\u2019ve sorted the rankings in terms of my choices, so we can compare the two. In SQL, we do this anyway.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true\">DECLARE @OurData TABLE\r\n    (\r\n    row INT IDENTITY PRIMARY KEY,\r\n    x int NOT NULL,\r\n    y int NOT NULL\r\n    );\r\n  INSERT INTO @OurData\r\n    (x, y)\r\n  SELECT \r\n   x,y\r\n   FROM (VALUES\r\n  (1,3), --Menetou-Salon \r\n  (2,4), --Muscadet-S\u00e8vre et Maine\r\n  (3,2), --Quincy \r\n  (4,5), --Reuilly \r\n  (5,11), --Muscadet-Coteaux de la Loire\r\n  (6,6), --Chinon\r\n  (7,12), --Muscadet-C\u00f4tes de Grand Lieu\r\n  (8,9), --Vouvray\r\n  (9,8), --Tourraine\r\n  (10,10), --Anjou-Saumur\r\n  (11,1), --Sancerre\r\n  (12,7) --Pouilly-Fum\u00e9\r\n  )f(x,y)\r\n  ORDER BY x\r\n\r\n  SELECT --Kendall's rank correlation sample estimate \u03c4\r\n  \t   CONVERT(NUMERIC(8,2),(SUM(CASE WHEN (i.x &lt; j.x AND i.y &lt; j.y) OR (i.x &gt; j.x AND i.y &gt; j.y) THEN 1 ELSE 0 END)) -- concordant\r\n  \t   - SUM(CASE WHEN (i.x &lt; j.x AND i.y &gt; j.y) OR (i.x &gt; j.x AND i.y &lt; j.y) THEN 1 ELSE 0 END)) -- discordant\r\n  \t   \/COUNT(*) AS Tau\r\n   FROM  @OurData i CROSS JOIN @OurData j\r\n  WHERE i.row&lt;&gt;j.row\r\n  \/*\r\n  # which we can confirm in R. \r\n  x &lt;- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12) \r\n  y &lt;- c(3, 4, 2, 5, 11, 6, 12, 9, 8, 10, 1, 7) \r\n  cor.test(x, y, method = \"kendall\", alternative = \"two.sided\") \r\n  Kendall's rank correlation sample estimate \u03c4= : 0.242424 \r\n  alternate hypothesis: true \u03c4 \u22600 \r\n  T -statistic: 41.0 \r\n  p-value: 0.310810 \r\n  *\/<\/pre>\n<p>\u2026and that gives us<\/p>\n<pre>Tau\r\n  0.2424242424242\r\n  (1 row(s) affected)<\/pre>\n<p>Notice that we are not concerned with the individual wines, only with the extent to which the two judges agree in their subjective rankings of the wines.<\/p>\n<p>Using the same data, we can also calculate Spearman&#8217;s Rho.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true \">SELECT 1 - (6 * SUM(POWER(x - y, 2)))\r\n\/ CONVERT(NUMERIC(8, 2), (COUNT(*) * (POWER(COUNT(*), 2) - 1))) AS spearman_rho\r\nFROM @OurData;<\/pre>\n<p>which gives<\/p>\n<pre>spearman_rho\r\n0.307692308\r\n\r\n(1 row(s) affected)<\/pre>\n<p>You can convince yourself that the calculation for Kendall&#8217;s Rho doesn&#8217;t \u00a0need ranked data. You can pretend that, instead of ranking their wines in order, Tony and I scored each wine on a zany scale between 1 and 100. Here is this un-ranked data.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true \">SELECT \r\nx,y\r\nFROM (VALUES\r\n(5,20), --Menetou-Salon \r\n(13,28), --Muscadet-S\u00e8vre et Maine\r\n(23,12), --Quincy \r\n(28,33), --Reuilly \r\n(38,75), --Muscadet-Coteaux de la Loire\r\n(41,43), --Chinon\r\n(50,82), --Muscadet-C\u00f4tes de Grand Lieu\r\n(56,61), --Vouvray\r\n(61,56), --Tourraine\r\n(71,71), --Anjou-Saumur\r\n(80,6), --Sancerre\r\n(81,50) --Pouilly-Fum\u00e9\r\n)f(x,y)\r\nORDER BY x<\/pre>\n<p>You&#8217;ll see that, with this un-ranked data, Kendall&#8217;s Tau gives precisely the same result.<\/p>\n<p>Now what if we couldn\u2019t really distinguish some of the wines and gave them tied rankings? What if I put\u00a0Menetou-Salon and Muscadet-S\u00e8vre et Maine first in my ranking? Well, Spearman&#8217;s rho wouldn&#8217;t work. We now have to use a variety of Kendall&#8217;s tau called Tau-B. Kendall\u2019s Tau is pretty resilient to ties, but Tau-B is better where there are ties.<\/p>\n<pre class=\"theme:ssms2012-simple-talk lang:tsql decode:true  \">DECLARE @OurData TABLE\r\n    (\r\n    row INT IDENTITY PRIMARY KEY,\r\n    x INT NOT NULL,\r\n    y INT NOT NULL\r\n    );\r\n  INSERT INTO @OurData\r\n    (x, y)\r\n    SELECT f.x, f.y\r\n    FROM\r\n      (VALUES\r\n         (1, 3), --Menetou-Salon \r\n         (1, 4), --Muscadet-S\u00e8vre et Maine\r\n         (3, 2), --Quincy \r\n         (3, 4), --Reuilly \r\n         (3, 10), --Muscadet-Coteaux de la Loire\r\n         (6, 6), --Chinon\r\n         (7, 10), --Muscadet-C\u00f4tes de Grand Lieu\r\n         (7, 9), --Vouvray\r\n         (8, 8), --Tourraine\r\n         (10, 10), --Anjou-Saumur\r\n         (11, 1), --Sancerre\r\n         (1, 7) --Pouilly-Fum\u00e9\r\n      ) f(x, y)\r\n    ORDER BY f.x;\r\n  DECLARE @n1 NUMERIC(18, 6), @n2 NUMERIC(18, 6), @n0 NUMERIC(18, 6),\r\n    @nc NUMERIC(18, 6), @nd NUMERIC(18, 6);\r\n  SELECT @n1 = COALESCE(SUM(ties.t * (ties.t - 1)) \/ 2, 0)\r\n    FROM\r\n      (SELECT COUNT(*) AS t\r\n         FROM @OurData\r\n         GROUP BY [@OurData].x\r\n         HAVING COUNT(*) &gt; 1\r\n      ) ties;\r\n  SELECT @n2 = COALESCE(SUM(ties.t * (ties.t - 1)) \/ 2, 0)\r\n    FROM\r\n      (SELECT COUNT(*) AS t\r\n         FROM @OurData\r\n         GROUP BY [@OurData].y\r\n         HAVING COUNT(*) &gt; 1\r\n      ) ties;\r\n  SELECT @n0 = COUNT(*),\r\n  \t   @nc = SUM(CASE WHEN (i.x &lt; j.x AND i.y &lt; j.y) OR (i.x &gt; j.x AND i.y &gt; j.y) THEN 1 ELSE 0 END), -- concordant\r\n  \t   @nd =  SUM(CASE WHEN (i.x &lt; j.x AND i.y &gt; j.y) OR (i.x &gt; j.x AND i.y &lt; j.y) THEN 1 ELSE 0 END) -- discordant\r\n    FROM @OurData i\r\n      CROSS JOIN @OurData j\r\n    WHERE i.row &lt;&gt; j.row;\r\n  SELECT CONVERT(NUMERIC(8, 2), (@nc - @nd) \/ SQRT((@n0 - @n1) * (@n0 - @n2))) AS 'Tau (\u03c4)';<\/pre>\n<p>\u2026 Which gives\u2026<\/p>\n<pre>Tau (\u03c4)\r\n  0.21<\/pre>\n<p>So there we have it. If your data really can\u2019t be called parametric, then Kendall\u2019s Tau is pretty useful. Your data doesn\u2019t have to be rank orders, the statistic will consume all manner of data. Data scientists like it because it irons out those pesky outliers that chance throws at you. I\u2019m wary of using it without understanding the data properly, because it is easy to draw unwarranted conclusions from it, but otherwise it is a useful inference tool.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Statistical calculations in SQL are often perfectly easy to do. SQL was designed to be a natural fit for calculating correlation, regression and variance on large quantities of data. It just isn&#8217;t always immediately obvious how.   In the second of a series of articles, Phil factor shows how calculating a non-parametric correlation via Kendall&#8217;s Tau or Spearman&#8217;s Rho can be stress-free.&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143528,2,143531],"tags":[43637],"coauthors":[6813],"class_list":["post-70558","post","type-post","status-publish","format-standard","hentry","category-bi-sql-server","category-blogs","category-t-sql-programming-sql-server","tag-statisticsinsql"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/70558","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=70558"}],"version-history":[{"count":12,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/70558\/revisions"}],"predecessor-version":[{"id":71971,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/70558\/revisions\/71971"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=70558"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=70558"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=70558"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=70558"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}