String Comparisons in SQL: The Longest Common Substring.

I’ve always wanted a SQL function that tells me the longest substring shared between two strings. As a present to myself, I’ve written one. I hope someone else finds it useful.

SQL isn’t particularly good at searching for strings within text. If you prepare things properly by creating inversion tables (inverted indexes), suffix trees  or tries  so as to allow it to do exact comparisons it is very quick, but this isn’t usually possible because data changes so quickly. You can use the brute-force search through text using wildcards, but this isn’t usually appropriate when dealing with large results.  Of course, you can use add-in components such as full-text search  to do the job for you in SQL Server

What have we got that is useful for comparing  text?

  • Is the first string contained in the second string?

  • Are the strings identical?

  • Is the first string at the start of the second string? Charindex (‘first’,’Second’)=1

  • Is the first string at the end of the second string?

What have we got that is just-about useless?

  • Soundex:  Difference() and Soundex()

What is missing? Here are the main ones.

  • Built-in Regex Searches – CLR routines are fine, but it would be nice to have standardised built-in functions for doing this just as long as nobody expects them to be SARGable.
  • Longest common subsequence – What is the longest sequence of characters in common between the two strings
  • Longest Common Substring -what is the longest string in common between the two strings: The equivalent of the inner join
  • Levenshtein distance -how similar are the two strings? (how many edits are needed to get from one string to the other).

There are several others, such as ‘Hamming Difference’, ‘MostFreqKSDF‘ or the Jaccard Index, for judging the similarity of text. Many of them have esoteric uses in genetics or pattern recognition, whereas we’re much more interested in chores such as data cleansing and input validation.

Plenty of people would argue that we have no real reason to need anything complicated in terms of string comparison in SQL, particularly as it would be hard to scale it up to the size  of  relations we deal string comparison. There is some truth in this. Even the built-in string functions can be death to a query if used to filter results.

In this blog, I’ll tackle the  Longest Common Substring,  which is a common, and useful, function that tells you the longest common substring between two strings.

If you, for example, were to compare ‘And the Dish ran away with the Spoon’ with ‘away’, you’d get ‘away’ as being the string in common. Likewise, comparing ‘465932859472109683472’ with ‘697834859472135348’ would give you ‘8594721’

With any SQL function, you have slow performance, but this does not entirely preclude their use.  String comparisons tend to be slow by the nature of sequence being so important. In the case of string comparison, you can very quickly get bogged down unless you ‘think relationally’.  My first attempt at this routine, copying the standard ‘dynamic’ algorithm, comparing the first 1120 character paragraph of ‘Moby Dick’ with itself, took 28 seconds.  A quick change to a more SQL-based way of doing it shortened the comparison to two seconds.  Even this is extravagant by C# standards but string comparisons never come cheap.

The routine I ended up with isn’t portable beyond SQL Server or Sybase, since it uses ‘quirky update’. However, if anyone can come up with a more portable method that is as fast, I’d be delighted. I ended up deciding that not only would you want the string that matched, but where it was in both strings and how long it was. Come to think of it, if you are doing a DIFF between strings, then you’d want to know all the substrings, not just the longest one, and so I devised a table function that would give you that if required