String Comparisons in SQL: The Longest Common Subsequence

Relational databases aren’t really designed to deal easily with arbitrary sequence, though this is improving with the window functions. Strings and text are sequences. Lists are often sequenced.  If you hear people describe an entity such as an invoice in terms of its ordinal sequence ‘the first invoice’ or  ‘the fourth invoice’, then you know that it is.

Let’s take a typical ‘sequence’ problem that comes up in science, and occasionally in commerce. We’ll use strings for our example, but they could be words in text, tree-ring measurements, genes, or even crime statistics. 

We have two strings, and we want to determine the longest sequence of characters contained the first string that are present in the same order in a second string.  Unlike the ‘Longest Common Substring’ problem, we are not specifying that they must be adjacent in either string. 

This looks easy, one would think. Not so fast; it is a minefield. If you take away  the requirement that the characters need to be adjacent, then the number of alternative solutions become much greater.

Take the classic sample strings ‘thisisatest’ and ‘testing123testing’. If you just choose the first available matches in common you get the wrong solution. Tisi,  you might get ‘test’  but there is a longer common substring ‘tsitest’. You can just do the brute-force approach of finding all common subsequences and choosing the best, but believe me it doesn’t scale! The best approach I know of is the dynamic programming solution. (see also the Hirschberg algorithm and the ‘greedy algorithm’)  This  has been described in great length elsewhere, even on Youtube. There are a number of solutions for different languages, and even in SQL, but I thought I’d do my own, based on a number table.

We’ll leave out the number table build, since I’ve already posted that in a previous blog

There are some very ingenious solutions in Oracle and some good purely iterative SQL Solutions. I’m certain that there is a good solution using window functions but I thought I ought to leave something for the readers of this blog!