{"id":80289,"date":"2018-08-08T12:00:52","date_gmt":"2018-08-08T12:00:52","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=80289"},"modified":"2019-01-22T16:51:08","modified_gmt":"2019-01-22T16:51:08","slug":"finding-duplicated-data-in-a-case-insensitive-column","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/finding-duplicated-data-in-a-case-insensitive-column\/","title":{"rendered":"Finding duplicated data in a case insensitive column"},"content":{"rendered":"<p>The other day, I had a problem with some data that I never dreamed I would ever see. In a case insensitive database, in a table&#8217;s column that was case insensitive, the customer was using the data as case sensitive. Firstly, let&#8217;s just go ahead and say it. &#8220;This was a sucky implementation.&#8221; But as is common, in my typical role as a data architect in the data warehousing team, I get to learn all sorts of interesting techniques for finding and dealing with &#8220;data&#8221; that has been used in &#8220;interesting&#8221; ways.<\/p>\n<p>What is kind of interesting is actually figuring out what that duplicated data was. The case that I was dealing with wasn&#8217;t a kind of useful packed surrogate value, where you may use a base 62 number, with a-z, A-Z and 0-9 as characters. So 1, 2, &#8230; , 9, 0, a, b, c, &#8230; x, y, z, A, B.. etc. 1A1 is a different value in that sequence than 1a1, and is greater . Neat technique, and one that I have been threatening to develop using a SEQUENCE object, where you can pack in a lot of sequential data in a small number of bytes. No, this wasn&#8217;t a useful case such as this, in this case, one value was lower case, another had leading capitals. So perhaps &#8220;active customer&#8221; and &#8220;Active Customer&#8221;. Yeah, seriously, they meant different things.<\/p>\n<p><em>Note:\u00a0The query I will use will help to find the permutations of values that you have in your data (Like &#8220;United States&#8221;, &#8216;UNITED STATES&#8221; for example.) Hence this is not a completely esoteric exercise.<\/em><\/p>\n<p>To find the data, I figured I would use a case sensitive collation, but it isn&#8217;t as straightforward as grouping on the data using a collation. For a sample set of data, I will use the following:<\/p>\n<pre>DROP TABLE IF EXISTS #Color\r\n --Note, because of the COLLATE argument, you must alias the column \r\nSELECT ColorID, ColorName COLLATE Latin1_General_CI_AS AS ColorName--Make sure column is case insensitive, no matter what your platform\r\nINTO #Color\r\nFROM WideWorldImporters.Warehouse.Colors\r\n\r\nINSERT INTO #Color(ColorID, ColorName)\r\nSELECT TOP 3 ColorID + 36, LOWER(ColorName) AS ColorName\r\nFROM #Color;<\/pre>\n<p>Now, because it is case insensitive, to the normal user, you can&#8217;t really tell the difference, but in this contrived set, we will have &gt; 1 row with the same color value (in my real example, there were thousands).<\/p>\n<pre>SELECT ColorName, COUNT(*)\r\nFROM #Color\r\nGROUP BY ColorName\r\nHAVING COUNT(*) &gt; 1;<\/pre>\n<p>This returns:<\/p>\n<p><code>ColorName <\/code><br \/>\n <code>-------------------- -----------<\/code><br \/>\n <code>Azure\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 2<\/code><br \/>\n <code>beige\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 2<\/code><br \/>\n <code>Black\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0 2<\/code><\/p>\n<p>Now, whether or not &#8220;beige&#8221; and &#8220;Beige&#8221; are bizarrely enough different things, or just misformatted data, this is not what we want back.\u00a0We want to treat them as different, so using a case sensitive collation (you need the COLLATE argument in the SELECT clause if you have it in the GROUP BY clause:<\/p>\n<pre>SELECT ColorName COLLATE Latin1_General_CS_AS, COUNT(*)\r\nFROM #Color\r\nGROUP BY ColorName COLLATE Latin1_General_CS_AS\r\nHAVING COUNT(*) &gt; 1;<\/pre>\n<p>This doesn&#8217;t return anything. Hmm. The trick is in the HAVING clause. Use COUNT DISTINCT on the column as case sensitive instead of the GROUP BY and we get the items we want, but still just one per:<\/p>\n<pre>SELECT ColorName, COUNT(*)\r\nFROM #Color\r\nGROUP BY ColorName \r\nHAVING COUNT(DISTINCT ColorName COLLATE Latin1_General_CS_AS) &gt; 1;<\/pre>\n<p>The output is the same as we had previously, with 2 rows per color (though this time they are all the lowercase items). However, now the ColorName column returned by the SELECT clause is case insensitive, so we can use it to get the case insensitive duplicates<\/p>\n<pre>SELECT *\r\nFROM #Color\r\nWHERE ColorName IN ( SELECT ColorName\r\n                     FROM #Color\r\n                     GROUP BY ColorName \r\n                     HAVING COUNT(DISTINCT ColorName COLLATE Latin1_General_CS_AS) &gt; 1\r\n                   );<\/pre>\n<p>Which returns the 6 rows with the duplicates we are hunting for:<\/p>\n<p><code>ColorID\u00a0 \u00a0 \u00a0ColorName<\/code><br \/>\n <code>----------- --------------------<\/code><br \/>\n <code>1\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Azure<\/code><br \/>\n <code>2\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Beige<\/code><br \/>\n <code>3\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 \u00a0Black<\/code><br \/>\n <code>37\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 azure<\/code><br \/>\n <code>38\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 beige<\/code><br \/>\n <code>39\u00a0 \u00a0 \u00a0 \u00a0 \u00a0 black<\/code><\/p>\n<p>With this set, you could then group on the values using the case sensitive collation to see how many duplicates you have, such as:<\/p>\n<pre>SELECT ColorName COLLATE Latin1_General_CS_AS AS ColorName, COUNT(*) AS NumberOfUses\r\nFROM #Color\r\nWHERE ColorName IN ( SELECT ColorName\r\n                     FROM #Color\r\n                     GROUP BY ColorName \r\n                     HAVING COUNT(DISTINCT ColorName COLLATE Latin1_General_CS_AS) &gt; 1\r\n                   )\r\nGROUP BY ColorName COLLATE Latin1_General_CS_AS;<\/pre>\n<p>Easy enough, but hopefully if you get stuck doing this some day, this will help you get going.\n <\/p>\n","protected":false},"excerpt":{"rendered":"<p>The other day, I had a problem with some data that I never dreamed I would ever see. In a case insensitive database, in a table&#8217;s column that was case insensitive, the customer was using the data as case sensitive. Firstly, let&#8217;s just go ahead and say it. &#8220;This was a sucky implementation.&#8221; But as&#8230;&hellip;<\/p>\n","protected":false},"author":56085,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[5134],"coauthors":[19684],"class_list":["post-80289","post","type-post","status-publish","format-standard","hentry","category-blogs","tag-sql-prompt"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/80289","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/56085"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=80289"}],"version-history":[{"count":2,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/80289\/revisions"}],"predecessor-version":[{"id":80352,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/80289\/revisions\/80352"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=80289"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=80289"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=80289"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=80289"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}