I’ve always wanted a SQL function that tells me the longest substring shared between two strings. As a present to myself, I’ve written one. I hope someone else finds it useful.
SQL isn’t particularly good at searching for strings within text. If you prepare things properly by creating inversion tables (inverted indexes), suffix trees or tries so as to allow it to do exact comparisons it is very quick, but this isn’t usually possible because data changes so quickly. You can use the brute-force search through text using wildcards, but this isn’t usually appropriate when dealing with large results. Of course, you can use add-in components such as full-text search to do the job for you in SQL Server
What have we got that is useful for comparing text?
- Is the first string contained in the second string?
1234Select case when Patindex ('%first%','second') >0 then 'true' else 'False' end-- useful with wildcard characters-- orSelect case when Charindex ('first','Second')>0 then 'true' else 'False' end
- Are the strings identical?
1Select case when 'first'='Second' then 'true' else 'False' end
- Is the first string at the start of the second string? Charindex (‘first’,’Second’)=1
1Select case when Charindex ('first','Second')=1 then 'true' else 'False' end
- Is the first string at the end of the second string?
1Select case when Patindex ('%first','second') >0 then 'true' else 'False' end
What have we got that is just-about useless?
- Soundex: Difference() and Soundex()
What is missing? Here are the main ones.
- Built-in Regex Searches – CLR routines are fine, but it would be nice to have standardised built-in functions for doing this just as long as nobody expects them to be SARGable.
- Longest common subsequence – What is the longest sequence of characters in common between the two strings
- Longest Common Substring -what is the longest string in common between the two strings: The equivalent of the inner join
- Levenshtein distance -how similar are the two strings? (how many edits are needed to get from one string to the other).
There are several others, such as ‘Hamming Difference’, ‘MostFreqKSDF‘ or the Jaccard Index, for judging the similarity of text. Many of them have esoteric uses in genetics or pattern recognition, whereas we’re much more interested in chores such as data cleansing and input validation.
Plenty of people would argue that we have no real reason to need anything complicated in terms of string comparison in SQL, particularly as it would be hard to scale it up to the size of relations we deal string comparison. There is some truth in this. Even the built-in string functions can be death to a query if used to filter results.
In this blog, I’ll tackle the Longest Common Substring, which is a common, and useful, function that tells you the longest common substring between two strings.
If you, for example, were to compare ‘And the Dish ran away with the Spoon’ with ‘away’, you’d get ‘away’ as being the string in common. Likewise, comparing ‘465932859472109683472’ with ‘697834859472135348’ would give you ‘8594721’
With any SQL function, you have slow performance, but this does not entirely preclude their use. String comparisons tend to be slow by the nature of sequence being so important. In the case of string comparison, you can very quickly get bogged down unless you ‘think relationally’. My first attempt at this routine, copying the standard ‘dynamic’ algorithm, comparing the first 1120 character paragraph of ‘Moby Dick’ with itself, took 28 seconds. A quick change to a more SQL-based way of doing it shortened the comparison to two seconds. Even this is extravagant by C# standards but string comparisons never come cheap.
The routine I ended up with isn’t portable beyond SQL Server or Sybase, since it uses ‘quirky update’. However, if anyone can come up with a more portable method that is as fast, I’d be delighted. I ended up deciding that not only would you want the string that matched, but where it was in both strings and how long it was. Come to think of it, if you are doing a DIFF between strings, then you’d want to know all the substrings, not just the longest one, and so I devised a table function that would give you that if required
--This will need a NUMBERS table, stocked with numbers. If you haven't got one
--this will create it automatically
IF NOT EXISTS (SELECT 1 FROM information_Schema.Tables
CREATE TABLE [dbo].[Numbers]
CONSTRAINT [Index_Numbers] PRIMARY KEY CLUSTERED ([number] ASC)
IF NOT EXISTS(SELECT 1 FROM numbers WHERE number=99999)
TRUNCATE TABLE numbers
;WITH Digits(i) AS
FROM (VALUES (1), (2), (3), (4), (5), (6), (7), (8), (9), (0)) AS X(i))
INSERT INTO numbers(number)
SELECT (D6.i*1000000 +D5.i*100000 + D4.i*10000 + D3.i * 1000 + D2.i * 100
+ D1.i * 10 + D0.i + 1) AS seq
FROM Digits AS D0, Digits AS D1, Digits AS D2, Digits AS D3,
Digits AS D4, Digits AS D5, Digits AS D6
IF OBJECT_ID (N'LongestCommonSubstring') IS NOT NULL
DROP FUNCTION LongestCommonSubstring
CREATE FUNCTION LongestCommonSubstring
The longest common subSubstring (LCS) tells you the longest common substring between two strings.
If you, for example, were to compare 'And the Dish ran away with the Spoon' with 'away', you'd
get 'away' as being the string in common. Likewise, comparing '465932859472109683472' with
'697834859472135348' would give you '8594721'. This returns a one-row table that gives you the
length and location of the string as well as the string itself. It can easily be modified to give
you all the substrings (whatever your criteria for the smallest substring. E.g. two characters?
Author: Phil Factor
date: 05 Dec 2014
Select * from dbo.LongestCommonSubstring ('1234', '1224533324')
Select * from dbo.LongestCommonSubstring ('thisisatest', 'testing123testing')
Select * from dbo.LongestCommonSubstring ( 'findthishere', 'where is this?')
Select * from dbo.LongestCommonSubstring ( null, 'xab')
Select * from dbo.LongestCommonSubstring ( 'not beginning-middle-ending',
the longest common subString as a string
RETURNS @hit TABLE --returns a single row table
--(it is easy to change to return a string but I wanted the location of the match)
MatchLength INT,--the length of the match. Not necessarily the length of input
FirstCharInMatch INT,--first character of match in first string
FirstCharInString INT,--first character of match in second string
CommonString VARCHAR(8000) --the part of the FirstString successfully matched
DECLARE @Order INT, @TheGroup INT, @Sequential INT
--this table is used to do a quirky update to enable a grouping only on sequential characters
DECLARE @Scratch TABLE (TheRightOrder INT IDENTITY PRIMARY KEY,TheGroup smallint, Sequential INT,
FirstOrder smallint, SecondOrder smallint, ch CHAR(1))
--first we reduce the amount of data to those characters in the first string that have a match
--in the second, and where they were.
INSERT INTO @Scratch ( TheGroup , firstorder, secondorder, ch)
SELECT Thefirst.number-TheSecond.number AS TheGroup,Thefirst.number, TheSecond.number, TheSecond.ch
FROM --divide up the first string into a table of characters/sequence
(SELECT number, SUBSTRING(@FirstString,number,1) AS ch
FROM numbers WHERE number <= LEN(@FirstString)) TheFirst
INNER JOIN --divide up the second string into a table of characters/sequence
(SELECT number, SUBSTRING(@SecondString,number,1) AS ch
FROM numbers WHERE number <= LEN(@SecondString)) TheSecond
ON Thefirst.ch= Thesecond.ch --do all valid matches
ORDER BY Thefirst.number-TheSecond.number, TheSecond.number
--now @scratch has all matches in the correct order for checking unbroken sequence
SELECT @Order=-1, @TheGroup=-1, @Sequential=0 --initialise everything
UPDATE @Scratch --now check by incrementing a value every time a sequence is broken
SET @Sequential=Sequential =
CASE --if it is not a sequence from the one before increment the variable
WHEN secondorder=@order+1 AND TheGroup=@TheGroup
THEN @Sequential ELSE @Sequential+1 END,
--now we just aggregate it, and choose the first longest match. Easy
INSERT INTO @hit (MatchLength,FirstCharInMatch, FirstCharInString,CommonString)
SELECT TOP 1 ---just the first. You may want more so feel free to change
COUNT(*) AS MatchLength,
MIN(secondorder) AS FirstCharInString,
COUNT(*)) AS CommonString
FROM @scratch GROUP BY TheGroup,Sequential
ORDER BY COUNT(*) DESC, MIN(firstOrder) ASC, MIN(SecondOrder) ASC
END--and we do a test run
--do an outer apply to check the obvious flaws and raise an error
--if any erros appear.
IF EXISTS (
SELECT firstString, secondString,correct, LCS.*
('Call me Ishmael. Some years ago...','Something','Some' ),
('unrestfulness','having little or no money in my purse, and nothing particular to interest me on shore','rest' ),
('I find myself involuntarily pausing before coffin warehouses','Jailhouse rock','house'),
('protest is useless','I need to test this routine. Tests are valuable','test' )
AS X(FirstString,secondString, Correct)
OUTER APPLY dbo.LongestCommonSubstring(firstString, secondString) AS LCS
RAISERROR ('the LongestCommonSubstring routine has broken',16,1)