String Comparisons in SQL: Edit Distance and the Levenshtein algorithm

Sometimes you need to know how similar words are, rather than whether they are identical. To get a general measure of similarity is tricky, impossible probably, because similarity is so strongly determined by culture. The Soundex algorithm can come up with some matches but insists that, for example, ‘voluptuousness’ and ‘velvet’ are similar. Family genealogists in Britain would never find a search algorithm that helped them to realise that some branches of the Theobald family spell their surname ‘Tibble’, though Soundex knows they are quite close. To gauge this sort of similarity, you need to know the context of language, pronunciation, culture and semantics. However, we can fall back on more scientific measures based on comparing sequences, though they will never be perfect.

‘Edit distance’ is the obvious way of measuring similarity of sequences of values, and strings are just sequences of characters. How many edits are needed to convert the first string into the second? Edits can be insertions, deletions, replacements or transpositions. Algorithms to do this ‘string-to-string correction problem‘ for any sequences have been current since the Sixties, and have become increasingly refined for a range of sciences such as genome research, Forensics, Dendrochronology, and for predictive text.

The Dynamic approach to finding an edit distance is best done with a matrix. Solutions for this exist for all manner of languages. SQL is handicapped by having no support for matrices. An equivalent relation can be used instead, but it is overkill, and slow. We don’t need all the wonderful things a relation gives us. The fastest dynamic solutions in SQL tend to use Unicode strings as cod matrices since these can be cajoled into use as integer matrixes, but they are painful to use and debug.

Fortunately, Third-party CLR functions exist for calculating Damerau-Levenshtein Distance in SQL Server. Levenshtein Distance, developed by Vladimir Levenshtein in 1965, is the algorithm we learn in college for measuring edit-difference. It doesn’t deal perfectly with transpositions because it doesn’t even attempt to detect them: it records one transposition as two edits: an insertion and a deletion. The Damerau-Levenshtein algorithm for Edit Distance solves this.

Here is a simple implementation of the Levenshtein algorithm, using the full matrix. In this version, I’ve restricted the length of string just so as to get a good performance. I doubled performance by removing the NVARCHAR(MAX)s. I is, hopefully, useful just to understand the algorithm, because once all the clever short-cuts are used, it can all get rather opaque. 

Here is the same basic code with the additions to turn it into the Damerau-Levenshtein algorithm. 

The best SQL solution I know of for the Levenshtein algorithm is the one attributed pseudonymously to ‘Arnold Fribble’ (possibly a reference to Arnold Rimmer of Red Dwarf, and his ‘friend’ Mr Flibble.) which is a SQL version of the improved Levenshtein algorithm that dispenses with the full matrix and just uses two vectors instead. It is featured here. There are improved versions around, such as the one here. They take around two thirds of the time to produce their result compared with the full-matrix version.

Can this be done in a relational way? Yes, but without support for matrixes, it is a bit painful, with the full-matrix version taking ten times as long as the string-based version. There is plenty of room for improvement though.

What to use? If I need the performance, and I’m allowed CLR, I use the CLR approach. Otherwise, I still go for the ‘Arnold Fribble’ approach to comparing strings  It scales reasonably well and is reasonably fast, but I can’t help thinking that with window functions and  number table, it would be possible to get an ‘edit distance’ figure  efficiently from a routine that is genuinely relational.