{"id":7699,"date":"2015-01-09T15:02:59","date_gmt":"2015-01-09T15:02:59","guid":{"rendered":"https:\/\/test.simple-talk.com\/uncategorized\/string-comparisons-in-sql-the-longest-common-subsequence\/"},"modified":"2017-10-30T16:03:21","modified_gmt":"2017-10-30T16:03:21","slug":"string-comparisons-in-sql-the-longest-common-subsequence","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/blogs\/string-comparisons-in-sql-the-longest-common-subsequence\/","title":{"rendered":"String Comparisons in SQL: The Longest Common Subsequence"},"content":{"rendered":"<div id=\"pretty\">\n<p>Relational databases aren&#8217;t really designed to deal easily with arbitrary sequence, though this is improving with the window functions. Strings and text are sequences. Lists are often sequenced. \u00a0If you hear people describe an entity such as an invoice in terms of its ordinal sequence &#8216;the first invoice&#8217; or\u00a0 &#8216;the fourth invoice&#8217;, then you know that it is.<\/p>\n<p>Let&#8217;s take a typical &#8216;sequence&#8217; problem that comes up in science, and occasionally in commerce. We&#8217;ll use strings for our example, but they could be words in text, tree-ring measurements, genes, or even crime statistics.\u00a0<\/p>\n<p>We have two strings, and we want to determine the longest sequence of characters contained the first string that are present in the same order in a second string.\u00a0 Unlike the &#8216;Longest Common Substring&#8217; problem, we are not specifying that they must be adjacent in either string.\u00a0<\/p>\n<p>This looks easy, one would think. Not so fast; it is a minefield. If you take away\u00a0 the requirement that the characters need to be adjacent, then the number of alternative solutions become much greater.<\/p>\n<p class=\"MsoNormal\">Take the classic sample strings &#8216;thisisatest&#8217; and &#8216;testing123testing&#8217;. If you just choose the first available matches in common you get the wrong solution. Tisi, \u00a0you might get &#8216;test&#8217;\u00a0 but there is a longer common substring &#8216;tsitest&#8217;. You can just do the brute-force approach of finding all common subsequences and choosing the best, but believe me it doesn&#8217;t scale! The best approach I know of is the dynamic programming solution. (see also the <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hirschberg's_algorithm\">Hirschberg algorithm<\/a> and the &#8216;greedy algorithm&#8217;) \u00a0This \u00a0has been described <a href=\"http:\/\/en.wikipedia.org\/wiki\/Longest_common_subsequence_problem\">in great length elsewhere<\/a>, even on <a href=\"https:\/\/www.youtube.com\/watch?v=P-mMvhfJhu8\">Youtube<\/a>. There are a <a href=\"http:\/\/rosettacode.org\/wiki\/Longest_common_subsequence\">number of solutions for different languages<\/a>, and even in SQL, but I thought I&#8217;d do my own, based on a number table.<\/p>\n<p>We&#8217;ll leave out the number table build, since I&#8217;ve already posted that in a previous blog<\/p>\n<pre class=\"theme:ssms2012 lang:tsql\">IF OBJECT_ID (N'LongestCommonSubsequence') IS NOT NULL\r\n\u00a0\u00a0 DROP FUNCTION LongestCommonSubsequence\r\nGO\r\n\r\nCreate FUNCTION LongestCommonSubsequence\r\n\/**\r\nsummary:\u00a0\u00a0 &gt;\r\n\u00a0The longest common subsequence (LCS) problem is the problem of finding the\r\n\u00a0longest subsequence common to all sequences in two sequences. It differs\r\n\u00a0from problems of finding common substrings: unlike substrings, subsequences\r\n\u00a0are not required to occupy consecutive positions within the original\r\n\u00a0sequences. For example, the sequences \"1234\" and \"1224533324\" have an LCS\r\n\u00a0of \"1234\":\r\nAuthor: Phil Factor\r\nRevision: 1.0\r\ndate: 05 Dec 2014\r\nexample:\r\n\u00a0code: |\r\n\u00a0\u00a0\u00a0\u00a0 Select dbo.LongestCommonSubsequence ('1234', '1224533324')\r\n\u00a0\u00a0\u00a0\u00a0 Select dbo.LongestCommonSubsequence ('thisisatest', 'testing123testing')\r\n\u00a0\u00a0\u00a0\u00a0 Select dbo.LongestCommonSubsequence ( 'XMJYAUZ', 'MZJAWXU') \r\n\u00a0\u00a0\u00a0\u00a0 Select dbo.LongestCommonSubsequence ( 'beginning-middle-ending',\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 'beginning-diddle-dum-ending')\r\nreturns:\u00a0\u00a0 &gt;\r\n\u00a0 the longest common subsequence as a string\r\n**\/\u00a0\u00a0\u00a0 \r\n(\r\n\u00a0@firstString Varchar(max),\r\n\u00a0@SecondString Varchar(max)\r\n)\r\nRETURNS varchar(max)\r\nas begin\r\n\r\nDeclare @Array Varchar(MAX)\r\nDeclare @ArrayMax int\r\nDeclare @west char(1)\r\nDeclare @Lines Varchar(max)\r\nDeclare @ii int, @jj int, @iiMax int, @jjMax int, @index int\r\n\r\nSelect @iiMax=len(@FirstString), --the length of the first string\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 @jjMax=len(@SecondString), --the length of the second string\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 @index=@jjMax+1 --where the first new item (matrix has an extra row &amp; column)\r\nSelect @ArrayMax=(@iiMax)*(@jjMax)--total length of the array iterated \r\nSelect @Array=replicate(char(0), @jjMax+1) --add in the first nought-filled row\r\n\r\nSelect \r\n\u00a0 @Index=@Index+case when (number-1) % @jjMax = 0 then 2 else 1 end, --current index\r\n\u00a0 @west=case when (number-1) % @jjMax = 0 then Char(0) else substring(@Array,@index-1,1) end,\r\n\u00a0 --as we add an extra column in, this is a dangerous operation otherwise\r\n\u00a0 @Array=@Array --we add a character to the string array, Char(0)--char\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0+ case when (number-1) % @jjMax = 0 then Char(0) else '' end -- begin new row always zero\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0+ case --check first to see if there was a match. If so, take north west number incremented\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 when substring(@firstString,((Number-1)\/@jjMax)+1,1)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 =substring(@SecondString, ((number-1)% @jjMax)+1 ,1)\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 \u00a0\u00a0\u00a0\u00a0 --remember to set the appropriate collation for this comparison!\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 then\u00a0 Char(Ascii(substring(@Array,@index-@jjmax-2,1))+1)--increment the NW value \r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 when Ascii(substring(@Array,@index-@jjmax-1,1))&gt;ascii(@west) --compare west to north\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 then substring(@Array,@index-@jjmax-1,1) --and take the larger\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 else @west-- number to west if larger\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 end\r\nfrom numbers where number&lt;=@ArrayMax\r\n\r\n--Now all we need to do is to backtrack through the matrix to find the best solution\r\nDeclare @commonString Varchar(max), @X_Y int\r\nSelect @CommonString ='' --this contains the string\r\nSelect @ii=@iimax+1,@jj=@jjmax+1\r\nwhile (@ii&gt;1 and @jj&gt;1) \r\n\u00a0 begin\r\n\u00a0 Select @X_Y = ((@ii-1)*(@jjMax+1))+@jj --the current matrix location\r\n\u00a0 if (substring(@firststring,@ii-1,1) = substring(@Secondstring,@jj-1,1))\r\n\u00a0\u00a0\u00a0\u00a0\u00a0\u00a0 BEGIN --if there is a match, add the character (we'll reverse it later)\r\n\u00a0\u00a0\u00a0 Select @CommonString=@CommonString+substring(@firststring,@ii-1,1)\r\n\u00a0\u00a0\u00a0 select @jj=@jj-1, @ii=@ii-1 --go north-west\r\n\u00a0\u00a0\u00a0 end\r\n\u00a0 else if ascii(Substring(@Array,@X_Y,1)) = ascii(Substring(@Array, @X_Y-@jjMax-1, 1))\r\n\u00a0\u00a0\u00a0 select @ii=@ii-1\r\n\u00a0 ELSE if ascii(Substring(@Array,@X_Y,1)) = ascii(Substring(@Array, @X_Y-1, 1))\r\n\u00a0\u00a0\u00a0 select @jj=@jj-1\r\n\u00a0 else break\u00a0 \r\n\u00a0 if @@error&gt;0 break\u00a0\u00a0 \r\n\u00a0 end\r\n\u00a0return\u00a0 Reverse(@CommonString)\r\n\u00a0end\r\ngo\r\n\r\nDeclare @timing datetime Select @Timing=GetDate()\r\n\r\nif dbo.LongestCommonSubsequence ('1234', '1224533324')&lt;&gt;'1234' \r\n\u00a0\u00a0 raiserror('test 1 failed',16,1)\r\nif dbo.LongestCommonSubsequence ('thisisatest', 'testing123testing')&lt;&gt;'tsitest' \r\n\u00a0\u00a0 raiserror('test 2 failed',16,1)\r\nif dbo.LongestCommonSubsequence ('XMJYAUZ', 'MZJAWXU')&lt;&gt;'MJAU'\r\n\u00a0\u00a0 raiserror('test 3 failed',16,1)\r\nif dbo.LongestCommonSubsequence ('yab', 'xabyrbyab')&lt;&gt;'yab'\r\n\u00a0\u00a0 raiserror('test 4 failed',16,1)\r\nif dbo.LongestCommonSubsequence ('beginning-middle-ending','beginning-diddle-dum-ending')\r\n\u00a0\u00a0 &lt;&gt;'beginning-iddle-ending' raiserror('test 5 failed',16,1)\r\n\r\nselect datediff(millisecond,@timing,GetDate()) as milliseconds\r\n\r\n<\/pre>\n<p>There are some very ingenious solutions in Oracle and some good purely iterative SQL Solutions. I&#8217;m certain that there is a good solution using window functions but I thought I ought to leave something for the readers of this blog!<\/p>\n<\/div>\n","protected":false},"excerpt":{"rendered":"<p>Relational databases aren&#8217;t really designed to deal easily with arbitrary sequence, though this is improving with the window functions. Strings and text are sequences. Lists are often sequenced. \u00a0If you hear people describe an entity such as an invoice in terms of its ordinal sequence &#8216;the first invoice&#8217; or\u00a0 &#8216;the fourth invoice&#8217;, then you know&#8230;&hellip;<\/p>\n","protected":false},"author":154613,"featured_media":0,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[2],"tags":[],"coauthors":[6813],"class_list":["post-7699","post","type-post","status-publish","format-standard","hentry","category-blogs"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/7699","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/154613"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=7699"}],"version-history":[{"count":10,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/7699\/revisions"}],"predecessor-version":[{"id":75224,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/7699\/revisions\/75224"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=7699"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=7699"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=7699"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=7699"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}