{"id":83759,"date":"2019-04-05T02:03:21","date_gmt":"2019-04-05T02:03:21","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=83759"},"modified":"2019-04-05T02:03:21","modified_gmt":"2019-04-05T02:03:21","slug":"finding-a-character-in-every-character-string-by-elimination","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/other\/finding-a-character-in-every-character-string-by-elimination\/","title":{"rendered":"Finding a character in every character string by elimination"},"content":{"rendered":"<p>A few weeks ago, a coworker had a scenario where he was trying to import a large amount of data into some format (I think it was JSON, but I think it was a special limitation&#8230; that isn&#8217;t really important to the solution), and it didn&#8217;t allow certain characters. Narrowing down what the bad character was, was proving to be a hassle.<\/p>\n<p>This reminded me of a solution that I had used back when I first had gotten into the whole numbers table thing. Using a number&#8217;s table, you can break down every value in a column into one character per row, and then do some analysis. In our case, I want to find a bad character, by eliminating known acceptable characters.<\/p>\n<p>For an example, I am going to take a fresh version of the WideWorldImporters database, and look at their Application.People table. The Fullname column contains characters that are not in the US centric characters, and can cause some, poorly written, pieces of software to fail.<\/p>\n<p>To start with, we will create a number table, with basically contains 1 row for every number from 0 up. We will load it up to 999999, not that our strings will be that long, but this is a typical good size for most uses of a number table, and you can increase that pretty simply by adjusting the query.<\/p>\n<pre class=\"lang:none theme:none\">USE WideWorldImporters;\r\nGO\r\nCREATE SCHEMA Tools;\r\nGO\r\nCREATE TABLE Tools.Number\r\n(\r\n    I   int CONSTRAINT PKTools_Number PRIMARY KEY\r\n);\r\nGO\r\n;WITH digits (I) AS (--set up a set of numbers from 0-9\r\n        SELECT I\r\n        FROM   (VALUES (0),(1),(2),(3),(4),(5),\r\n                       (6),(7),(8),(9)) AS digits (I))\r\n--builds a set of data from from 0 to 999999\r\n,Integers (I) AS (\r\n        SELECT D1.I + (10*D2.I) + (100*D3.I) + \r\n               (1000*D4.I) + (10000*D5.I) + (100000*D6.I)\r\n        FROM digits AS D1 CROSS JOIN digits AS D2 \r\n             CROSS JOIN digits AS D3\r\n             CROSS JOIN digits AS D4 \r\n             CROSS JOIN digits AS D5\r\n             CROSS JOIN digits AS D6 )\r\n--insert into table\r\nINSERT INTO Tools.Number(I)\r\nSELECT I\r\nFROM   Integers;<\/pre>\n<p>Check out the data exists:<\/p>\n<pre class=\"lang:none theme:none\">SELECT *\r\nFROM   Tools.Number\r\nORDER  BY I;<\/pre>\n<p>If the output isn&#8217;t 1000000 rows, starting at 0 and ending at 999999, then something is incorrect in your configuration. Otherwise, let&#8217;s move on to the using this data.<\/p>\n<p>Now, the idea is that we will join the Application.People table to the Numbers table for a number of rows. We will do this for all of the numbers that are from 1 to the length of the name. Then use that value to get the substring of the value for that 1 character. I also include the Unicode value in the output to allow for some case sensitive operations, since UNICODE(&#8216;a&#8217;) &lt;&gt; UNICODE(&#8216;A&#8217;).<\/p>\n<pre class=\"lang:none theme:none \">SELECT People.FullName, Number.I AS position,\r\n       SUBSTRING(People.FullName,Number.I,1) AS [char],\r\n       UNICODE(SUBSTRING(People.FullName, Number.I,1)) \r\n                                                AS [Unicode],\r\n       CASE WHEN i = LEN(People.FullName) \r\n            THEN 1 ELSE 0 END AS last_character_flag\r\nFROM   Application.People AS People\r\n         JOIN Tools.Number\r\n               ON Number.I BETWEEN 1 AND LEN(People.FullName)\r\nORDER  BY FullName;<\/pre>\n<p>The output will have the full name repeated in the first column, then have the character position in the second, along with the character and Unicode output of the value (and an indicator if it is the last character of the string.) Now we can simply use a NOT LIKE expression against our single character, eliminating from contention the numbers (0-9) and letters (a-z), space, -, and a single quote (my database is case sensitive, adjust if you are dealing with case sensitive character sets):<\/p>\n<pre class=\"lang:none theme:none\">SELECT People.FullName, Number.I AS position,\r\n       SUBSTRING(People.FullName,Number.I,1) AS [char],\r\n       UNICODE(SUBSTRING(People.FullName, Number.I,1)) \r\n                                                AS [Unicode],\r\n \t  CASE WHEN i = LEN(People.FullName) \r\n                     THEN 1 ELSE 0 END AS last_character_flag\r\nFROM   Application.People AS People\r\n         JOIN Tools.Number\r\n               ON Number.I BETWEEN 1 AND LEN(People.FullName)\r\nWHERE  SUBSTRING(People.FullName,Number.I,1) \r\n        NOT LIKE '[- a-z0-9'']' --LIKE expr on the single char\r\nORDER  BY FullName;<\/pre>\n<p>What you will see is a set of data that has the characters that are not our like (which may actully mean the fullname value is duplicated) This may look like the following (This being output from <a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/azure-data-studio\/download?view=sql-server-2017\">Azure Data Studio<\/a>), depending of course on your client, and if this looks ok in your regional dialect, but unicode 129 appears to be a control character: <a href=\"https:\/\/www.compart.com\/en\/unicode\/U+0081\">https:\/\/www.compart.com\/en\/unicode\/U+0081<\/a>):<\/p>\n<p><img loading=\"lazy\" decoding=\"async\" width=\"644\" height=\"520\" class=\"wp-image-83760\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2019\/04\/word-image.png\" \/><\/p>\n<p>I believe these characters are there just to make it fun to use the WideWorldImporters database, but of course, this is a good thing when you are testing\/demonstrating software.<\/p>\n<p>Now, once you have this base set of data, with the characters now each on their own row to work with, you can use this query in a CTE and look at some &#8220;interesting&#8221; details about the characters in a set.<\/p>\n<p>For example, if you want to know which characters (ignoring case, in a case sensitive db) are used the most (and least) in a column (for some reason):<\/p>\n<pre class=\"lang:none theme:none\">WITH CharacterRows AS (\r\nSELECT People.FullName, Number.I AS position,\r\n       SUBSTRING(People.FullName,Number.I,1) AS [char],\r\n       UNICODE(SUBSTRING(People.FullName, Number.I,1)) \r\n                                           AS [Unicode],\r\n  CASE WHEN i = LEN(People.FullName) THEN 1 ELSE 0 END \r\n                                 AS last_character_flag\r\nFROM   Application.People AS People\r\n         JOIN Tools.Number\r\n               ON Number.I BETWEEN 1 AND LEN(People.FullName)\r\n)\r\nSELECT char, COUNT(*)\r\nFROM CharacterRows\r\nGROUP BY char\r\nORDER BY COUNT(*) DESC;<\/pre>\n<p>Or, if you want to see it case sensitive, you can use the Unicode value of the character:<\/p>\n<pre class=\"lang:none theme:none\">WITH CharacterRows AS (\r\nSELECT People.FullName, Number.I AS position,\r\n       SUBSTRING(People.FullName,Number.I,1) AS [char],\r\n       UNICODE(SUBSTRING(People.FullName, Number.I,1)) AS [Unicode],\r\n  CASE WHEN i = LEN(People.FullName) THEN 1 ELSE 0 END \r\n                                 AS last_character_flag\r\nFROM   Application.People AS People\r\n         JOIN Tools.Number\r\n               ON Number.I BETWEEN 1 AND LEN(People.FullName)\r\n)\r\nSELECT Unicode, MAX(char), COUNT(*)\r\nFROM CharacterRows\r\nGROUP BY Unicode\r\nORDER BY COUNT(*) DESC;<\/pre>\n<p>What is the most used character in the English language? The Internet tells me that it is e, but in the People.FullName column, it is the lowercase letter a, by a large margin.<\/p>\n<p>As for performance, I have use the version of this query to search through rows with millions of rows, with an average of 20 or so characters per value with very good performance. Aggregating the number of characters that were used in the strings was, on my Surface 4 laptop, notably laggy, taking around 10 seconds as opposed to sub second response to return the 13 rows using the characters that were not in the typical set of 26 letters, 10 numbers, &#8211; and space. Your mileage certainly may vary, so test, test, test.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>A few weeks ago, a coworker had a scenario where he was trying to import a large amount of data into some format (I think it was JSON, but I think it was a special limitation&#8230; that isn&#8217;t really important to the solution), and it didn&#8217;t allow certain characters. Narrowing down what the bad character&#8230;&hellip;<\/p>\n","protected":false},"author":56085,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[19684],"class_list":["post-83759","post","type-post","status-publish","format-standard","hentry","category-other"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/83759","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/56085"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=83759"}],"version-history":[{"count":2,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/83759\/revisions"}],"predecessor-version":[{"id":83762,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/83759\/revisions\/83762"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=83759"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=83759"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=83759"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=83759"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}