Searching for Strings in SQL Server Databases

For most of the time you will get adequate powerful searches from a well-normalized database that has all possible constraints and indexes. Just occasionally, usually when you’re searching through text, or you’ve inherited the work of someone whose major talents lie in other areas of life, you need other magic. This article is about some of the SQL magic that might be required.

The first type of search I’ll refer to is the ‘Brute-force’ search, where performance isn’t the primary concern. Even with the best of intentions, it is possible to find oneself searching whole blocks of text in a database, or searching across columns. Even as an experienced database developer, I find myself occasionally having to do ad-hoc searches for strings in databases, especially in large chunks of text. Anyone like me, who has to mop up after a database goes wrong, has a number of techniques in their metaphorical toolbox for getting to grips with the data.

The second type of search is the ‘search-engine’ type of search when you need to implement clever text-searching facilities in a database, the FullText indexing features of SQL Server (FTS) are great, and should always be the first port of call. By ‘clever’ we mean finding two words near each other, matching any word derived from a particular root (for example ‘search’, ‘searching’, ‘searched’), finding several words with distinct weightings, or doing a fuzzy search for a word or phrase. There are a few rare , but interesting, exceptions when you actually need something more special-purpose and ‘hand-crafted’.

We’ll tackle both types of searches in this article, but not Full-Text Search (FTS). For more on FTS, see Full-Text searches on documents in Filetables,, FTS features, and Rob Sheldon’s Full-Text Indexing Workbench.

Ad-hoc Brute-force searches of databases.

With no great sweat, it is possible to search for a phrase in a table or result using LIKE or PATINDEX. This is often called the ‘brute-force search. In many cases, it is surprisingly effective. I’ve gone into detail in how to use the wildcard features of PATINDEX here in the PATINDEX Workbench, and give an example of its use in a migration script for a database refactoring that involved creating two new columns in an address table.

The blunt instrument – Search anywhere in a query result

For the crudest sort of searches, we need to represent all the data from the result of a SQL query in string-format. The easiest way of accomplishing this is to get the result of a SQL Expression in XML form and then just keep the contents of every element in one big string. To do this, we’ll use this function.

IF OBJECT_ID(N'dbo.StripXMLTags') IS NOT NULL DROP FUNCTION dbo.StripXMLTags

create function dbo.StripXMLTags( @xml XML ) returns varchar(max) as

begin

declare @TextWithoutTags varchar(max);

with TheXMLDocument(contents) as

(

select TheXML.node.query('.') from @XML.nodes('/') as TheXML(node)

)

select @TextWithoutTags = contents.value('.', 'varchar(max)') from TheXMLDocument

return @TextWithoutTags

end

To show how quick and effective this is, we’ll search all columns of 20,000 rows in AdventureWorks’ person.contact table under two seconds */

1	Select patindex('%P.O Box 5%',dbo.StripXMLTags((Select * from person.contact for xml path)))

Note that we’re using the old 2008 version of AdventureWorks. Please amend to taste if using the later version. Of course, if you were looking for a particular string, you could display the context of the first string found, if any (otherwise NULL)

with TheText(TheResult) as (Select dbo.StripXMLTags((Select * from person.contact for xml path)))

Select case when HitLocation>0 then substring(TheResult,HitLocation-20,40) else null end from

(select patindex('%P.O Box 5%',theResult) as hitLocation, Theresult from TheText)f

If we wanted to know the location of all the strings, then we might put the matches into a table

Declare @string varchar(max),

@len int, @Hit int, @ii int

Declare @FoundStrings table (context varchar(40), location int)

Select @string= dbo.StripXMLTags((Select * from person.contact for xml path))

Select @len=len(@String),@ii=@len

While @ii>0

begin

Select @hit= patindex('%hank%',right(@string,@ii))

if @hit=0 break

insert into @FoundStrings(Context, Location)

Select substring (@String,@len-@ii+@hit,40),@len-@ii+@hit

Select @ii=@ii-@hit

end

Select * from @FoundStrings

As you can see, it is possible to search the results of any SQL expression you need. It is perfectly possible to automatically search every table in the database if you’re really keen on finding all occurrences of a string.

You’d want to turn it into a function for use

IF OBJECT_ID(N'dbo.SearchWithinAQuery') IS NOT NULL DROP FUNCTION dbo.SearchWithinAQuery

create function dbo.SearchWithinAQuery( @xml XML, @SearchPattern Varchar(80) )

/* searches the inner text within an XML document, returning the context of

every match. The searches are in PATINDEX format

Author: Phil Factor

Revision: 1.3

date: 8 Apr 2014

example:

- code: SELECT * from dbo.SearchWithinAQuery((select * from person.contact for xml path ),'%hank%')

RETURNS

@FoundStrings TABLE

(

context varchar(40),

location int --the substring location in which the string is found

)

begin

Declare @string varchar(max), @len int, @Hit int, @ii int

Select @string= dbo.StripXMLTags(@xml), @len=len(@String),@ii=@len

While @ii>0--loop around until all instances are found

begin

Select @hit= patindex(@SearchPattern,right(@string,@ii))

if @hit=0 break --if no more are to be found

insert into @FoundStrings(Context, Location)--insert each found item

Select substring (@String,@len-@ii+@hit,40),@len-@ii+@hit

Select @ii=@ii-@hit--decrement the index from the right of the string

end

Return

end

You might want to use the same ‘PATINDEX’ loop to also search through free-form notes or text documents. Here we look through Bram Stoker’s Dracula to look for instances of the word ‘terror’. (I’ve included the text as a zip file at the bottom of the article – remember to change the path!)

DECLARE @string VARCHAR(MAX), @len int, @Hit int, @ii int

SELECT @string = BulkColumn --read the text of a file

FROM OPENROWSET(BULK 'D:\files\dracula.txt', SINGLE_BLOB) AS x

--read in the text of Dracula

Declare @FoundStrings table (context varchar(40), location int)

---our results table variable

Select @len=len(@String),@ii=@len

While @ii>0 --while there is some of the string to do

begin

Select @hit= patindex('%terror%',right(@string,@ii))

if @hit=0 break

insert into @FoundStrings(Context, Location)

Select substring (@String,@len-@ii+@hit-20,40),@len-@ii+@hit

Select @ii=@ii-@hit

end

Select * from @FoundStrings

OK. That’s fine if you are just trying to find a particular string, but you could want something a bit more refined.

A sharper stick for free-form searches.

Sometimes, for example, you want a way of selecting customers from your database in a Google-style way, without having to bother with whether it is a surname, town, postcode or customer_ID that you are typing in.

Imagine that you, working for AdventureWorks know only that your customer has the word hacienda in the address but you reckon that his postcode starts with ’94’ With this code, you’ve homed in on Mr Wright from your 18508 customers in half a second

Select ContactString from

(Select coalesce(title+' ','')+firstname+' '+coalesce(middleName+' ','')

+lastname+coalesce(' '+suffix,'')+' Email:'+emailAddress+' Phone:'+Phone

+' Address: '+AddressLine1+ coalesce(' '+AddressLine2,'')+', '+City+ ' '

+postalcode +' State/Province:'+SP.Name+ ' Country:'+SP.CountryRegionCode

as ContactString

from person.contact c

inner join Sales.Individual I ON C.ContactID = I.ContactID

inner join Sales.CustomerAddress AS CA ON CA.CustomerID = I.CustomerID

inner join Person.Address AS A ON A.AddressID = CA.AddressID

inner join person.StateProvince SP ON A.StateProvinceID=SP.StateProvinceID

where ContactString like '%Hacienda%94%'

Where databases are relatively small, you can do well with routines that search several columns of views, or derived tables. As soon as the amount of data grows, this becomes a bad idea. Note also that if you didn’t know the order of occurrence within the line, you’d have to do …

1 2	... where ContactString like '%Hacienda%94%' or ContactString like '%94%Hacienda%'

… which doubles the time it takes. We’ll revisit this problem later on after I’ve introduced you to some ‘search-engine’ techniques.

If you have a small database, though, the brute-force approach is quick and easy, but mostly just for ad-hoc work, since these queries tend to have a high CPU load.

The Inversion Technique for Searching databases

Here is a very simple technique, which we’ve used for a long time with a succession of applications. It is usually called the ‘Inverted’ or ‘Inversion’ index technique. (see Search engine indexing for a full discussion). Both FullText indexing and Apache Lucene use it. MapReduce is an elaboration of the same simple technique. The technique scales very well and we’ve tested it indexing up to 8 million documents. Even when searching through eight million documents, It gets results within three seconds if the indexes are right, but you’d expect most searches to be far quicker. In the example code I’ll use to demonstrate, It found strings in the King James Bible in 20 ms that took just under two seconds with a brute-force search. Because each search takes preparation, and any permanent inversion index takes considerable maintenance to cope with insertions, updates and deletes, it isn’t much good for ad-hoc work.

Firstly, we will need to create our inversion index. I generally use two tables, one of which has every word in it that exists within the data you want to search, and another table that stores the location of every instance of every word. The term ‘location’ will mean different things in practice, depending on the problem you want to solve, but in these examples, I will use the publication(or document), offset character location, and sequence (1..n)

First we’ll need a way of chopping text into its constituent words (The Map process). This is a routine that does the job but it is slow. I’ve found nothing as quick in TSQL, but it is an ideal candidate for a CLR. I’ve left one of my quick unit-test routines just to show how I do such things.

/*--------------------*/

IF OBJECT_ID(N'IterativeWordChop') IS NOT NULL DROP FUNCTION IterativeWordChop

CREATE FUNCTION [dbo].[IterativeWordChop]

summary: >

This Table-valued function takes any text as a parameter and splits it into its constituent words, passing back the order in which they occured and their location in the text.

Author: Phil Factor

Revision: 1.3

date: 2 Apr 2014

example:

- code: SELECT * FROM IterativeWordChop('this tests stuff. Will it work?')

- code: SELECT * FROM IterativeWordChop('this ------- tests it again; Will it work ...')

- code: SELECT * FROM IterativeWordChop('Do we allow %Wildcards% like %x%?')

returns: >

Table of SequenceNumber, item (word) and sequence no.

**/

(

@string VARCHAR(MAX)

)

RETURNS

@Results TABLE

(

Item VARCHAR(255),

location int,

Sequence int identity primary key

)

BEGIN

DECLARE @Len INT, @Start INT, @end INT, @Cursor INT,@length INT

SELECT @Cursor=1, @len=LEN(@string)

WHILE @cursor<@len

BEGIN

SELECT @start=PATINDEX('%[^A-Za-z0-9][A-Za-z0-9%]%',

' '+SUBSTRING (@string,@cursor,50)

)-1

if @start<0 break

SELECT @length=PATINDEX('%[^A-Z''a-z0-9-%]%',SUBSTRING (@string,@cursor+@start+1,50)+' ')

INSERT INTO @results(item, location)

SELECT SUBSTRING(@string,@cursor+@start,@length), @cursor+@start

SELECT @Cursor=@Cursor+@Start+@length+1

END

RETURN

END

---sanity-check the word-chop routine to make sure it is more or less working

if exists (select count(*)

from (

Select item, location,sequence

FROM IterativeWordChop('Hello. This tests it again; Will it work? ...')

union all

Select * from ( values

('Hello',1,1), ('This',8,2), ('tests',13,3), ('it',19,4), ('again',22,5), ('Will',29,6),

('it',34,7), ('work',37,8)) CorrectResult(item, location,sequence)

)f --our 'correct' values

group by sequence, location,item having count(*)<2

)

Raiserror('fault in IterativeWordChop', 16,1)

Now, armed with a way of chopping up text. We can now create the basic tables. We have one table that contains all the words in our set of publications, and another table that maps the text, word by word, recording the location, sequence and publication. In this first example, we’re only going to have one publication.

--tidy up any existing data from a previous run.

IF OBJECT_ID(N'publication') IS NOT NULL DROP table publication

IF OBJECT_ID(N'WordOccurence') IS NOT NULL DROP table wordOccurence

IF OBJECT_ID(N'Word') IS NOT NULL DROP table word

--create a table with the occurrence of each word in it.

Create table Word

(

Item VARCHAR(255) not null primary key,

frequency int not null default(0)

)

--create a table with the word-mapping in it.

Create table WordOccurence

(

Item VARCHAR(255) not null ,

location int not null,

Sequence int not null,

Publication int not null

)

ALTER TABLE WordOccurence ADD PRIMARY KEY(item, sequence);

Create table publication

(

Publication_ID int primary key,

TheText Varchar(max)

Because of the way we want insert the data, the foreign key gets inserted later.

Now we have got this far, we can read the entire King James Bible (which is there on the server’s disk -download it from the head of the article) into our publication table and index it up (map each word) This will hopefully give us enough text to give the algorithm a realistic size of data.

/* Let's put the text of a large book into a VarChar(MAX)variable */

DECLARE @LotsOfText VARCHAR(MAX)

SELECT @LotsOfText = BulkColumn

-- 2005 onwards only for this code)

FROM OPENROWSET(BULK 'D:\files\kjb.txt', SINGLE_BLOB) AS x

insert into publication (Publication_ID, TheText) values(1,@LotsOfText)

insert into WordOccurence (Item,location,Sequence,Publication)

SELECT Item,location,Sequence,1 FROM Iterativewordchop(@LotsOfText)--

insert into word(item, frequency)

Select item, count(*)

from WordOccurence group by item

ALTER TABLE WordOccurence ADD FOREIGN KEY (item) REFERENCES word

So we can begin testing this and trying various experiments. Just to kick things off, I’ve implemented a function that will find exact phrases, and show instance where they occur in the text. This receives the words you wish to search for, and then determines how rare they are. It searches only for the rarest in the phrase (you can adjust the proportion.

In practice, I only search for the rarest words in a string and then use a regex to do a brute-force search once the search has been narrowed down to the locations that have the rarest words together in them. This cuts out a lot of tiresome logic! In this case, we use a variant of relational division to find the locations where all the words occur, but it isn’t so much use for fuzzier searches. The fuzzier searches are too complicated to make for an entertaining read in an article, but it is a great exercise to think them up. For a start, you could do quite a lot of fuzzy searches using a LIKE join between your search words and the WORD table.

IF OBJECT_ID(N'FindString') IS NOT NULL DROP FUNCTION FindString

CREATE FUNCTION [dbo].[FindString]

summary: >

This Table-valued function takes text as a parameter and tries to find it in the WordOccurence table

Author: Phil Factor

Revision: 1.3

date: 10 Apr 2014

example:

- code: SELECT * FROM FindString('Who is he that hideth counsel without knowledge?',1)

- code: SELECT * FROM FindString(' And I turned to see the voice that spake with me. And being turned, I saw seven golden candlesticks')

returns: >

passes back the location where they were found, and the number of words matched in the string.

**/

(

@string VARCHAR(100),

@Publication int

)

RETURNS

@finds table

(location int,

hits int)

BEGIN

Declare @WordsToLookUp table(Item VARCHAR(255), location int, Sequence int primary key)

Declare @wordCount int, @searches int

-- chop the string into its constituent words, with the sequence

insert into @WordsToLookUp(Item,location,Sequence)

Select Item,location,Sequence FROM dbo.Iterativewordchop (@string)

-- determine how many words and work out what proportion to search for

Select @WordCount =count(*) from @WordsToLookUp

Select @searches=case when @wordcount < 3 then @wordcount else 2+(@wordcount/2) end

insert into @finds (location, hits)

Select min(wordOccurence.location), count(*) as matches

from WordOccurence

inner join

(Select top (@searches) word.item, searchterm.sequence from

@WordsToLookUp searchterm

inner join Word

on searchTerm.item=word.item order by frequency

) LessFrequentWords(item,Sequence)

on LessFrequentWords.item =WordOccurence.item

where publication=@Publication

group by wordoccurence.sequence-LessFrequentWords.sequence

having count(*)>= @searches order by count(*) desc

return

end

So now, armed with this, we can soon become ‘well versed’ in the bible, being able to quote chapter and verse for all sorts of biblical quotations.

Declare @originalText Varchar(max)

Select @originalText=TheText from publication where publication_ID=1

Select substring(@originalText,Location,180), hits FROM FindString('Which is the

earnest of our inheritance until the redemption of the purchased

possession, unto the praise of his glory.',1)

Select substring(@originalText,Location,180), hits FROM FindString('antichrist',1)

Select substring(@originalText,Location,60), hits FROM FindString('unto Philadelphia, and unto Laodicea',1)

I’ve already shown you how to find all instances of the word ‘Terror’ in the book ‘Dracula’ by brute-force. You can substitute Dracula for the KJB so as to use the inversion-index form, and try the two back-to-back. I’ve shown you a simple algorithm, but with everything laid out workbench-style, I guess that you’ll soon find ways of improving performance.

We’ll end with a more typical problem, and one that I’ve had to solve more than once! The application developers want the end users to locate customers or groups of customers by means of a simple google-like search. We negotiate an interface where a stored procedure receives a string containing words and we return all the customers that have all the words in. We allow wildcards

We can start off by creating our inversion tables: very similar to the previous example.

--if the index tables exist, then delete them

IF OBJECT_ID(N'CustomerWordOccurence') IS NOT NULL DROP table CustomerwordOccurence

IF OBJECT_ID(N'CustomerWord') IS NOT NULL DROP table Customerword

--the table that holds each word (and its frequency

Create table CustomerWord

(

Item VARCHAR(255) not null primary key,

frequency int not null default(0)

)

--the table that maps customer data, holding each word, and its location

Create table CustomerWordOccurence

(

Item VARCHAR(255) not null ,

location int not null,

Sequence int not null,

Publication int not null

)

ALTER TABLE CustomerWordOccurence ADD PRIMARY KEY(item, sequence,publication);

This time, we can create a text version of each row along with a primary key, and map that text. We use a very similar technique to the previous example , but each row becomes its own ‘publication’

/* first we map the text value of each row into the CustomerWordOccurence table, recording the location of each word */

insert into CustomerWordOccurence

Select item, location, sequence,contactid from

(Select c.contactID, coalesce(title+' ','')+firstname+' '+coalesce(middleName+' ','')

+lastname+coalesce(' '+suffix,'')+' Email:'+emailAddress+' Phone:'+Phone

+' Address: '+AddressLine1+ coalesce(' '+AddressLine2,'')+', '+City+ ' '

+postalcode +' State/Province:'+SP.Name+ ' Country:'+SP.CountryRegionCode

as ContactString

from AdventureWorks.person.contact c

inner join AdventureWorks.Sales.Individual I ON C.ContactID = I.ContactID

inner join AdventureWorks.Sales.CustomerAddress AS CA ON CA.CustomerID = I.CustomerID

and AddressTypeID=2

inner join AdventureWorks.Person.Address AS A ON A.AddressID = CA.AddressID

inner join AdventureWorks.person.StateProvince SP ON A.StateProvinceID=SP.StateProvinceID

cross apply dbo.iterativeWordChop(contactString)

/* then we store each unique word, storing its frequency ( we don't use the frequency in this example code) */

insert into Customerword(item, frequency)

Select item, count(*)

from CustomerWordOccurence group by item

ALTER TABLE CustomerWordOccurence ADD FOREIGN KEY (item) REFERENCES Customerword

Now we can define our procedure that returns a conventional result corresponding to the AdventureWorks customers (Sorry, AdventureWorks 2008 only) In this case we are allowing wildcards so we use the CustomerWord table to speed up that search (it takes a fifth of the time compared with an inner joing ON LIKE).

/* and we create the stored procedure that we've agreed with the application programmers.*/

IF OBJECT_ID(N'CustomerAddress') IS NOT NULL DROP procedure CustomerAddress

CREATE PROCEDURE CustomerAddress

/***summary: >

Select rows froom AdventureWorks 2008 based on the lookup of the words in the CustomerWordOccurence and CustomerWord tables, in which are stored every word from the data, along with the contactid of the contact

Author: Phil Factor

Revision: 1.0

date: 14 Apr 2014

example:

- code: Execute CustomerAddress 'Andrew california'

- code: Execute CustomerAddress '%x% Birmingham england'

- code: Execute CustomerAddress '%orge% british columbia'

returns: >

**/

@SearchString Varchar(200) = '%%'

BEGIN

Declare @WordsToLookUp table(Item VARCHAR(255), location int, Sequence int primary key)

Declare @WordCount int

insert into @WordsToLookUp(Item,location,Sequence)

Select Item,location,Sequence FROM dbo.Iterativewordchop (@SearchString)

Select @WordCount =count(*) from @WordsToLookUp

Select c.ContactID, NameStyle, Title, FirstName, MiddleName, LastName, Suffix,

EmailAddress, EmailPromotion, Phone, AddressLine1,AddressLine2,City,

PostalCode,SP.Name as StateProvince

from AdventureWorks.person.contact c --select your result

inner join AdventureWorks.Sales.Individual I ON C.ContactID = I.ContactID

inner join AdventureWorks.Sales.CustomerAddress AS CA ON CA.CustomerID = I.CustomerID

and AddressTypeID=2

inner join AdventureWorks.Person.Address AS A ON A.AddressID = CA.AddressID

inner join AdventureWorks.person.StateProvince SP ON A.StateProvinceID=SP.StateProvinceID

where c.contactID in

(Select publication from --where you have at least one hit of each

(select publication,WLU.Sequence

from CustomerWord CW

inner join @WordsToLookUp WLU on CW.item like WLU.item

inner join CustomerWordOccurence CWO on CW.item=CWO.item

group by publication,WLU.Sequence)AllWordMatches

group by publication having count(*)=@WordCount)

End

And we can try a few searches out to see what customers AdventureWorks has got.

Execute CustomerAddress 'taylor palo alto'

Execute CustomerAddress 'Andrew california'

Execute CustomerAddress 'rain drop'

Execute CustomerAddress '%x% Birmingham england'

Execute CustomerAddress '%orge% british columbia'

Naturally, in real life, you would need to update the inversion tables every time you do a change to the database. The logic is simple, you just delete all entries in the mapping table that relate to the changed rows and just insert the updated mapping. I’ll not show you how to do that in AdventureWorks, though, as it is pretty routine stuff.

So here we have it. I’ve introduced a number of different techniques that can be used to do searches through data in SQL Server databases. With all the examples, the more useful they are, the more effort in writing them, or maintaining the inversion/mapping tables. In the examples, I’ve tried to keep the code simple and, with the invertion indexes in particular, there are many ingenious optimizations and tricks you can add. I’ve seen versions, for example, that allow complex expressions in the query terms, and attempts to order by relevance. There are all manner of proximity tricks and wildcard techniques. I hope you’ll be moved to try some out!

Register for Simple Talk

Searching for Strings in SQL Server Databases

Ad-hoc Brute-force searches of databases.

The blunt instrument – Search anywhere in a query result

A sharper stick for free-form searches.

The Inversion Technique for Searching databases

See Also

Article tags

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases

Ad-hoc Brute-force searches of databases.

The blunt instrument – Search anywhere in a query result

A sharper stick for free-form searches.

The Inversion Technique for Searching databases

See Also

Article tags

Recommended

About the author

Phil Factor

Phil Factor's contributions

Articles

Books

Top topics

Phil Factor's latest contributions:

Accessibility, and the Need for Ingenious Simplicity in Computer Scripts, Languages and Programs

Data Documents and Common Sense

AI and Databases