{"id":7671,"date":"2014-12-20T11:04:10","date_gmt":"2014-12-20T11:04:10","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/string-comparisons-in-sql-the-longest-common-substring\/"},"modified":"2017-10-30T16:05:01","modified_gmt":"2017-10-30T16:05:01","slug":"string-comparisons-in-sql-the-longest-common-substring","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/other\/string-comparisons-in-sql-the-longest-common-substring\/","title":{"rendered":"String Comparisons in SQL: The Longest Common Substring."},"content":{"rendered":"<div id=\"pretty\">\n<p>I&#8217;ve always wanted a SQL function that tells me the longest substring shared between two strings. As a present to myself, I&#8217;ve written one. I hope someone else finds it useful.<\/p>\n<p>SQL isn&#8217;t particularly good at searching for strings within text. If you prepare things properly by creating inversion tables (<a href=\"http:\/\/en.wikipedia.org\/wiki\/Inverted_index\">inverted indexes<\/a>), <a href=\"http:\/\/en.wikipedia.org\/wiki\/Suffix_tree\">suffix trees <\/a>\u00a0or <a href=\"http:\/\/en.wikipedia.org\/wiki\/Trie\">tries<\/a>\u00a0 so as to allow it to do exact comparisons it is very quick, but this isn&#8217;t usually possible because data changes so quickly. You can use the brute-force search through text using wildcards, but this isn&#8217;t usually appropriate when dealing with large results. \u00a0Of course, you can use add-in components such as <a href=\"https:\/\/docs.microsoft.com\/en-us\/sql\/relational-databases\/search\/full-text-search\">full-text search <\/a>\u00a0to do the job for you in SQL Server<\/p>\n<h4>What have we got that is useful for comparing\u00a0 text?<\/h4>\n<ul>\n<li>Is the first string contained in the second string?\n<pre class=\"theme:ssms2012 lang:tsql\">Select\u00a0 case when Patindex ('%first%','second') &gt;0 then 'true' else 'False' end \r\n-- useful with wildcard characters\r\n-- or \r\nSelect\u00a0 case when Charindex ('first','Second')&gt;0 then 'true' else 'False' end \r\n<\/pre>\n<\/li>\n<li>Are the strings identical?\n<pre class=\"theme:ssms2012 lang:tsql\">Select\u00a0 case when 'first'='Second' then 'true' else 'False' end\r\n<\/pre>\n<\/li>\n<li>Is the first string at the start of the second string? Charindex (&#8216;first&#8217;,&#8217;Second&#8217;)=1\n<pre class=\"theme:ssms2012 lang:tsql\">Select\u00a0 case when Charindex ('first','Second')=1 then 'true' else 'False' end \r\n<\/pre>\n<\/li>\n<li>Is the first string at the end of the second string?\n<pre class=\"theme:ssms2012 lang:tsql\">Select\u00a0 case when Patindex ('%first','second') &gt;0 then 'true' else 'False' end \r\n<\/pre>\n<\/li>\n<\/ul>\n<h4>What have we got that is just-about useless?<\/h4>\n<ul>\n<li>Soundex: \u00a0Difference() and Soundex()<\/li>\n<\/ul>\n<h4>What is missing? Here are the main ones.<\/h4>\n<ul>\n<li><b>Built-in Regex Searches<\/b> &#8211; CLR routines are fine, but it would be nice to have standardised built-in functions for doing this just as long as nobody expects them to be SARGable.<\/li>\n<li><b> <a href=\"http:\/\/en.wikipedia.org\/wiki\/Longest_common_subsequence_problem\">Longest common subsequence<\/a><\/b> &#8211; What is the longest sequence of characters in common between the two strings<\/li>\n<li><b> <a href=\"http:\/\/en.wikipedia.org\/wiki\/Longest_common_substring_problem\">Longest Common Substring<\/a><\/b> -what is the longest string in common between the two strings: The equivalent of the inner join<\/li>\n<li><b><a href=\"http:\/\/en.wikipedia.org\/wiki\/Levenshtein_distance\">Levenshtein distance<\/a><\/b> -how similar are the two strings? (how many edits are needed to get from one string to the other).<\/li>\n<\/ul>\n<p>There are several others, such as &#8216;Hamming Difference&#8217;, &#8216;<a href=\"http:\/\/en.wikipedia.org\/wiki\/Most_frequent_k_characters\">MostFreqKSDF<\/a>&#8216; or the Jaccard Index, for judging the similarity of text. Many of them have esoteric uses in genetics or pattern recognition, whereas we&#8217;re much more interested in chores such as data cleansing and input validation.<\/p>\n<p>Plenty of people would argue that we have no real reason to need anything complicated in terms of string comparison in SQL, particularly as it would be hard to scale it up to the size\u00a0 of\u00a0 relations we deal string comparison. There is some truth in this. Even the built-in string functions can be death to a query if used to filter results.<\/p>\n<p>In this blog, I&#8217;ll tackle the \u00a0<a href=\"http:\/\/en.wikipedia.org\/wiki\/Longest_common_substring_problem\">Longest Common Substring<\/a>, \u00a0which is a common, and useful, function that tells you the longest common substring between two strings.<\/p>\n<p>If you, for example, were to compare &#8216;And the Dish ran away with the Spoon&#8217; with &#8216;away&#8217;, you&#8217;d get &#8216;away&#8217; as being the string in common. Likewise, comparing &#8216;465932859472109683472&#8217; with &#8216;697834859472135348&#8217; would give you &#8216;8594721&#8217;<\/p>\n<p>With any SQL function, you have slow performance, but this does not entirely preclude their use.\u00a0 String comparisons tend to be slow by the nature of sequence being so important. In the case of string comparison, you can very quickly get bogged down unless you &#8216;think relationally&#8217;.\u00a0 My first attempt at this routine, copying the standard &#8216;dynamic&#8217; algorithm, comparing the first 1120 character paragraph of &#8216;Moby Dick&#8217; with itself, took 28 seconds.\u00a0 A quick change to a more SQL-based way of doing it shortened the comparison to two seconds.\u00a0 Even this is extravagant by C# standards but string comparisons never come cheap.<\/p>\n<p>The routine I ended up with isn&#8217;t portable beyond SQL Server or Sybase, since it uses &#8216;quirky update&#8217;. However, if anyone can come up with a more portable method that is as fast, I&#8217;d be delighted. I ended up deciding that not only would you want the string that matched, but where it was in both strings and how long it was. Come to think of it, if you are doing a DIFF between strings, then you&#8217;d want to know all the substrings, not just the longest one, and so I devised a table function that would give you that if required<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">--This will need a NUMBERS table, stocked with numbers. If you haven't got one\r\n--this will create it automatically \r\nIF NOT EXISTS (SELECT 1 FROM information_Schema.Tables\r\n\u00a0 WHERE table_name='Numbers')\r\n\u00a0 BEGIN \r\n\u00a0\u00a0\u00a0 CREATE TABLE [dbo].[Numbers]\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 (\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 [number] [int],\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 CONSTRAINT [Index_Numbers] PRIMARY KEY CLUSTERED ([number] ASC)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 ON [PRIMARY]\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 )\r\n\u00a0\u00a0\u00a0 ON\u00a0 [PRIMARY] \r\n\u00a0 END\r\nIF NOT EXISTS(SELECT 1 FROM numbers WHERE number=99999)\r\nBEGIN\r\nTRUNCATE TABLE numbers\r\n\u00a0\u00a0\u00a0 ;WITH Digits(i) AS \r\n\u00a0\u00a0\u00a0\u00a0\u00a0 (SELECT i\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (0)) AS X(i))\r\n\u00a0\u00a0\u00a0\u00a0 INSERT INTO numbers(number)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 SELECT (D6.i*1000000 +D5.i*100000 + D4.i*10000 + D3.i * 1000 + D2.i * 100 \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 + D1.i * 10 + D0.i + 1) AS seq\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 FROM Digits AS D0, Digits AS D1, Digits AS D2, Digits AS D3, \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 Digits AS D4, Digits AS D5, Digits AS D6\r\n\u00a0\u00a0\u00a0 END\u00a0 \r\n\r\nIF OBJECT_ID (N'LongestCommonSubstring') IS NOT NULL\r\n\u00a0\u00a0 DROP FUNCTION LongestCommonSubstring\r\nGO\r\n\r\nCREATE FUNCTION LongestCommonSubstring\r\n\/**\r\nsummary:\u00a0\u00a0 &gt;\r\n\u00a0The longest common subSubstring (LCS) tells you the longest common substring between two strings.\r\n\t\u00a0If you, for example, were to compare 'And the Dish ran away with the Spoon' with 'away', you'd \r\n\t\u00a0get 'away' as being the string in common. Likewise, comparing '465932859472109683472' with \r\n\t\u00a0'697834859472135348' would give you\u00a0 '8594721'. This returns a one-row table that gives you the \r\n\t\u00a0length and location of the string as well as the string itself. It can easily be modified to give\r\n\t\u00a0 you all the substrings (whatever your criteria for the smallest substring. E.g. two characters? \r\n\u00a0\r\nAuthor: Phil Factor\r\nRevision: 1.0\r\ndate: 05 Dec 2014\r\nexample:\r\ncode: |\r\n\u00a0\u00a0\u00a0\u00a0 Select * from dbo.LongestCommonSubstring ('1234', '1224533324')\r\n\u00a0\u00a0\u00a0\u00a0 Select * from dbo.LongestCommonSubstring ('thisisatest', 'testing123testing')\r\n\u00a0\u00a0\u00a0\u00a0 Select * from dbo.LongestCommonSubstring ( 'findthishere', 'where is this?') \r\n\u00a0\u00a0\u00a0\u00a0 Select * from dbo.LongestCommonSubstring ( null, 'xab') \r\n\u00a0\u00a0\u00a0\u00a0 Select * from dbo.LongestCommonSubstring ( 'not beginning-middle-ending',\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 'beginning-diddle-dum-ending')\r\nreturns:\u00a0\u00a0 &gt;\r\n\u00a0 the longest common subString as a string\r\n**\/\u00a0\u00a0\u00a0 \r\n(\r\n@firstString VARCHAR(MAX),\r\n@SecondString VARCHAR(MAX)\r\n)\r\nRETURNS @hit TABLE --returns a single row table \r\n--(it is easy to change to return a string but I wanted the location of the match)\r\n(\r\nMatchLength INT,--the length of the match. Not necessarily the length of input \r\nFirstCharInMatch INT,--first character of match in first string\r\nFirstCharInString INT,--first character of match in second string\r\nCommonString VARCHAR(8000) --the part of the FirstString successfully matched\r\n)\r\n\r\nAS BEGIN\r\nDECLARE @Order INT, @TheGroup INT, @Sequential INT\r\n--this table is used to do a quirky update to enable a grouping only on sequential characters\r\nDECLARE\u00a0 @Scratch TABLE (TheRightOrder INT IDENTITY PRIMARY KEY,TheGroup smallint, Sequential INT,\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 FirstOrder smallint, SecondOrder smallint, ch CHAR(1))\r\n--first we reduce the amount of data to those characters in the first string that have a match \r\n--in the second, and where they were.\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \r\nINSERT INTO @Scratch ( TheGroup , firstorder,\u00a0 secondorder, ch)\r\n\u00a0\u00a0 SELECT Thefirst.number-TheSecond.number AS TheGroup,Thefirst.number, TheSecond.number, TheSecond.ch \r\n\u00a0\u00a0 FROM --divide up the first string into a table of characters\/sequence\r\n\u00a0\u00a0\u00a0 (SELECT number, SUBSTRING(@FirstString,number,1) AS ch\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 FROM numbers WHERE number &lt;= LEN(@FirstString)) TheFirst \r\n\u00a0\u00a0 INNER JOIN --divide up the second string into a table of characters\/sequence\r\n\u00a0\u00a0\u00a0 (SELECT number, SUBSTRING(@SecondString,number,1) AS ch\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 FROM numbers WHERE number &lt;= LEN(@SecondString))\u00a0 TheSecond\r\n\u00a0\u00a0 ON Thefirst.ch= Thesecond.ch --do all valid matches\r\n\u00a0\u00a0 ORDER BY Thefirst.number-TheSecond.number, TheSecond.number\r\n--now @scratch has all matches in the correct order for checking unbroken sequence\u00a0\u00a0 \r\nSELECT @Order=-1, @TheGroup=-1, @Sequential=0 --initialise everything\r\nUPDATE @Scratch --now check by incrementing a value every time a sequence is broken\r\n\u00a0 SET @Sequential=Sequential = \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 CASE --if it is not a sequence from the one before increment the variable\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 WHEN secondorder=@order+1 AND TheGroup=@TheGroup \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 THEN @Sequential ELSE @Sequential+1 END,\r\n\u00a0\u00a0 @Order=secondorder, \r\n\u00a0\u00a0 @TheGroup=TheGroup\r\n--now we just aggregate it, and choose the first longest match. Easy \u00a0 \r\nINSERT INTO @hit (MatchLength,FirstCharInMatch, FirstCharInString,CommonString)\r\nSELECT\u00a0 TOP 1 ---just the first. You may want more so feel free to change\r\n\u00a0\u00a0\u00a0 COUNT(*) AS MatchLength, \r\n\u00a0\u00a0\u00a0 MIN(firstorder) FirstCharInMatch,\r\n\u00a0\u00a0\u00a0 MIN(secondorder) AS FirstCharInString,\r\n\u00a0\u00a0\u00a0 SUBSTRING(@SecondString, \r\n\u00a0\u00a0\u00a0 MIN(secondorder), \r\n\u00a0\u00a0\u00a0 COUNT(*)) AS CommonString\r\n\u00a0 FROM @scratch GROUP BY TheGroup,Sequential \r\n\u00a0 ORDER BY COUNT(*) DESC, MIN(firstOrder) ASC, MIN(SecondOrder) ASC\r\nRETURN \r\nEND--and we do a test run\r\n\r\ngo\r\n\r\n--do an outer apply to check the obvious flaws and raise an error \r\n--if any erros appear.\r\nIF EXISTS (\r\n\u00a0 SELECT firstString, secondString,correct, LCS.*\r\n\u00a0 FROM (VALUES \r\n\u00a0\u00a0 ('Call me Ishmael. Some years ago...','Something','Some' ),\r\n\u00a0\u00a0 ('unrestfulness','having little or no money in my purse, and nothing particular to interest me on shore','rest' ),\r\n\u00a0\u00a0 ('1234563457','3456','3456' ),\r\n\u00a0\u00a0 ('','',NULL ),\r\n\u00a0\u00a0 (NULL,'',NULL ),\r\n\u00a0\u00a0 ('I find myself involuntarily pausing before coffin warehouses','Jailhouse rock','house'),\r\n\u00a0\u00a0 (',.-=dfgd%','-=','-='),\r\n\u00a0\u00a0\u00a0\u00a0\u00a0 ('protest is useless','I need to test this routine. Tests are valuable','test' )\r\n\u00a0\u00a0 ) \r\n\u00a0\u00a0 \r\n\u00a0\u00a0 AS X(FirstString,secondString, Correct)\r\n\u00a0 OUTER APPLY dbo.LongestCommonSubstring(firstString, secondString) AS LCS\r\n\u00a0 WHERE COALESCE(correct,'null')&lt;&gt;COALESCE(LCS.CommonString,'null')\r\n\u00a0 )\r\n\u00a0\u00a0\u00a0 RAISERROR ('the LongestCommonSubstring routine has broken',16,1)\u00a0\u00a0 \r\n<\/pre>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>I&#8217;ve always wanted a SQL function that tells me the longest substring shared between two strings. As a present to myself, I&#8217;ve written one. I hope someone else finds it useful. SQL isn&#8217;t particularly good at searching for strings within text. If you prepare things properly by creating inversion tables (inverted indexes), suffix trees \u00a0or&#8230;&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[6813],"class_list":["post-7671","post","type-post","status-publish","format-standard","hentry","category-other"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/7671","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=7671"}],"version-history":[{"count":9,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/7671\/revisions"}],"predecessor-version":[{"id":75225,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/7671\/revisions\/75225"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=7671"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=7671"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=7671"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=7671"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}