Exploring Semantic Search Key Term Relevance

Semantic Search is the new feature introduced in SQL Server 2012 that allows you to extract statistically relevant words, referred to by Microsoft as ‘key terms’, from unstructured documents stored in SQL Server. In addition to key term extraction, you can also find documents that are similar to one another based on the shared relevant key terms.

Note: I say “key terms” but you may see official Microsoft documentation call them “key phrases.” This is a bit of a misnomer because as of SQL Server 2012 RTM, these “phrases” are unigrams (single word terms). I do hope we’ll see this expand in future versions to support bigrams, trigrams and beyond.

If you’re wondering whether you can use Semantic Search for your own application requirements, I strongly recommend you test this functionality using your own data. The viability of this new feature depends on the nature of the data being explored. For example, if you’re looking to create an application that pulls unigrams from a set of customer documents, you’ll want to make sure that the terms that were extracted are truly relevant to the subject matter and not assume that all subject areas have equivalent support by the Semantic Search feature.

I’ve experimented with various types of SQL Server related documents in the past and I do see that in some cases the key terms that are extracted are what I would expect. However in several cases I’ve also found that terms extracted are not what I expected.

Let’s take an example of exploring a library of technical articles published by Simple-Talk. My Red Gate editor provided me with a database containing 1,388 articles. For the table containing these articles, I created an associated full-text index and enabled the Semantic Search feature. The table name containing the article text is called dbo.xlaANMarticles.

So to begin with, I asked a question about which terms were most significant across the entire library of articles? I asked this question using the following query, leveraging the SEMANTICKEYPHRASETABLE function:

SELECT TOP (10)

[skp].[keyphrase],

SUM([skp].[score]) AS [totalscore]

FROM [dbo].[xlaANMarticles] AS wp

INNER JOIN SEMANTICKEYPHRASETABLE(xlaANMarticles, article) AS [skp]

ON [wp].[articleid] = [skp].[document_key]

GROUP BY [skp].[keyphrase]

ORDER BY SUM([skp].[score]) DESC;

This query returned the following 10 rows:

keyphrase	totalscore
font-family	883.2972255
font	844.5126802
font-weight	789.5226674
margin-left	775.4817398
font-size	763.5196039
span	720.4189129
nbsp	701.175603
background-color	689.4987194
padding-left	687.912422
padding	682.9183152

As you can see – we’re getting noise. The articles are stored in the table with the original article text and associated HTML metadata. So before we can even see the relevant keywords, I needed to filter out the noise. For example, I created a full-text stoplist and added several “noise” terms:

CREATE FULLTEXT STOPLIST [slArticles];

ALTER FULLTEXT STOPLIST [slArticles] ADD 'font-family' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'font' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'font-weight' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'margin-left' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'span' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'nbsp' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'background-color' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'padding-left' LANGUAGE 'English';

ALTER FULLTEXT STOPLIST [slArticles] ADD 'padding' LANGUAGE 'English';

I ended up having to add several stopwords, manually reviewing what I knew was noise given the common subject-matter of the articles. Some terms like “table” or “column” were ambiguous. Were we talking about HTML or SQL in that context? There was not reliable way to know programmatically.

Also, as a short-cut, I scripted out the scripting of the ALTER FULLTEXT STOPLIST and then added the (more) obvious metadata choices:

SELECT TOP (100)

'ALTER FULLTEXT STOPLIST [slArticles] ADD ''' + skp.keyphrase +

''' LANGUAGE ''English'';',

SUM([skp].[score]) AS [totalscore]

FROM [dbo].[xlaANMarticles] AS wp

INNER JOIN SEMANTICKEYPHRASETABLE([xlaANMarticles], [article]) AS skp

ON [wp].[articleid] = [skp].[document_key]

GROUP BY [skp].[keyphrase]

ORDER BY SUM([skp].[score]) DESC;

After adding the various stopwords, I made sure that my stoplist was associated with my full-text index. For example:

CREATE FULLTEXT INDEX ON [dbo].[xlaANMarticles]

([article] LANGUAGE 1033 STATISTICAL_SEMANTICS)

KEY INDEX [PK_articles]

ON [ft_SimpleTalkArticles]

WITH STOPLIST = [slArticles];

After adding the stoplist and rebuilding the catalog, I saw the following top results:

keyphrase	totalscore
sql	483.0557612
database	224.5216756
select	156.4048312
query	126.7041208
css	124.2681781
type	110.4776265
databases	99.8270269
varchar	96.96654268
your	96.69461793
microsoft	94.31538226

As you can see, we’re getting closer to “relevant” terms – but there was still some noise. Including “CSS”, “your” and “Microsoft”. I decided to spend a few more minutes removing additional noise terms, including that noise that was related to the subject matter, for example – “Microsoft” may not be noise in some scenarios, but most certainly isn’t a helpful term when looking for statistically significant terms from a Microsoft-centric publication website.

My new top 10 was as follows:

keyphrase	totalscore
sql	483.2933343
database	225.7992599
select	158.2329725
query	127.7730246
databases	100.7824847
script	70.22023614
log	61.72487602
column	61.24853854
developers	60.87272082

Is this helpful compared to a standard search engine? It seems unlikely at this point.

What about the top significant terms for a specific article? I tested the following query against an article on Availability Group wait statistics I did back in May of 2012:

SELECT TOP (15)

[skp].[keyphrase],

SUM([skp].[score]) AS [totalscore]

FROM [dbo].[xlaANMarticles] AS [wp]

INNER JOIN SEMANTICKEYPHRASETABLE([xlaANMarticles], [article]) AS [skp]

ON [wp].[articleid] = [skp].[document_key]

WHERE [headline] = 'A first look at SQL Server 2012 Availability Group Wait Statistics'

GROUP BY [skp].[keyphrase]

ORDER BY SUM([skp].[score]) DESC;

In this case I saw the following terms scoped to just my article:

keyphrase	totalscore
nvarchar	0.437133938
sql	0.40759635
xel	0.391592562
failover	0.38552931
sqlserver	0.367718428
waitfor	0.352442741
replica	0.348848522
synchronous	0.347228736
duration	0.330882519
concurrency	0.321891427
asynchronous	0.311569184
wait	0.30677563
availability	0.299674034
desc	0.297606289
replicas	0.296193898

How relevant were these terms? Certainly “replica”, “synchronous”, “asynchronous”, “wait”, “availability”, and “replicas” fall into that category. But conspicuously missing are the various wait types themselves. For example, I would have expected to see wait types such as HADR_WORK_QUEUE, HADR_SYNC_COMMIT, and WRITELOG. Looking through the top 100 key phrases (the capped term list), I didn’t see these phrases surfaced as meaningful, even though from a search perspective people are likely to use them as key terms.

In addition to the missing relevant terms, we’re seeing a key limitation with using unigrams because we cannot represent the key phrase “wait statistics” as something that is likely the most relevant term in the entire paper. You can also see how subjective “relevant” is when we’re talking about specific industries or subject areas. How do you output terms that are as relevant for SQL Server as they are for the construction, retail, and medical industries? I’d see this as a very difficult task to achieve without some kind of human intervention.

Another example includes a query for the top terms from Jonathan Kehayias’ “Optimizing tempdb configuration with SQL Server 2012 Extended Events” article:

keyphrase	totalscore
sql	0.640196383
sqlserver	0.543729305
in-memory	0.534120202
contention	0.512095034
latch	0.488780767
allocation	0.485334814
gam	0.473063797
predicate	0.468191713
xel	0.466187418

Inability to have the phrase “Extended Events” is a key limiting factor – and with that aside, there was no identification of “tempdb” as a statistically significant word. There were plenty of other relevant words like “allocation”, “bitmap”, “contention”, “gam” (but no “sgam” or “pfs” terms interestingly enough given the frequency), “in-memory”, “histogram” – but would they be relevant in absence of the core subject of the paper?

In conclusion, I feel Semantic Search is an interesting first “alpha” version and I really do hope it is further cultivated by the Microsoft product team. I’m also interesting in hearing how it is applied in real-world scenarios. There may be applications that can benefit from it today, and I’d like to understand where the current boundaries of utility exist. If there is a next version, ideally it includes support for true key phrases. I’d also like to see the ability to extend the Semantic Search model to support industry terminology and custom term relevance weighting. That involves human-intervention of course, which isn’t ideal for various scenarios, but I’d still be willing to do some manual intervention in order in order to find relevant phrases in a library of content.

This document contains proprietary information and is protected by copyright law.

Register for Simple Talk

Exploring Semantic Search Key Term Relevance

Article tags

About the author

Fixing Gatekeeper Row Cardinality Estimate Issues

A first look at SQL Server 2012 Availability Group Wait Statistics

Exploring Semantic Search Key Term Relevance

Article tags

Recommended

Subscribe for more

About the author

Fixing Gatekeeper Row Cardinality Estimate Issues

A first look at SQL Server 2012 Availability Group Wait Statistics