Exploring Semantic Search Key Term Relevance

SQL Server's 'Semantic Search' feature seemed an exciting feature when first shown. Was it really true that Microsoft had come up with a system to rival the industry-leaders, one that could extract the contextual meaning of terms in text, or automatically categorise the subject matter of text? On first inspection, it seems unlikely.

Semantic Search is the new feature introduced in SQL Server 2012 that allows you to extract statistically relevant words, referred to by Microsoft as ‘key terms’, from unstructured documents stored in SQL Server. In addition to key term extraction, you can also find documents that are similar to one another based on the shared relevant key terms.

Note: I say “key terms” but you may see official Microsoft documentation call them “key phrases.” This is a bit of a misnomer because as of SQL Server 2012 RTM, these “phrases” are unigrams (single word terms). I do hope we’ll see this expand in future versions to support bigrams, trigrams and beyond.

If you’re wondering whether you can use Semantic Search for your own application requirements, I strongly recommend you test this functionality using your own data. The viability of this new feature depends on the nature of the data being explored. For example, if you’re looking to create an application that pulls unigrams from a set of customer documents, you’ll want to make sure that the terms that were extracted are truly relevant to the subject matter and not assume that all subject areas have equivalent support by the Semantic Search feature.

I’ve experimented with various types of SQL Server related documents in the past and I do see that in some cases the key terms that are extracted are what I would expect. However in several cases I’ve also found that terms extracted are not what I expected.

Let’s take an example of exploring a library of technical articles published by Simple-Talk. My Red Gate editor provided me with a database containing 1,388 articles. For the table containing these articles, I created an associated full-text index and enabled the Semantic Search feature. The table name containing the article text is called dbo.xlaANMarticles.

So to begin with, I asked a question about which terms were most significant across the entire library of articles? I asked this question using the following query, leveraging the SEMANTICKEYPHRASETABLE function:

This query returned the following 10 rows:

keyphrase totalscore
font-family 883.2972255
font 844.5126802
font-weight 789.5226674
margin-left 775.4817398
font-size 763.5196039
span 720.4189129
nbsp 701.175603
background-color 689.4987194
padding-left 687.912422
padding 682.9183152

As you can see – we’re getting noise. The articles are stored in the table with the original article text and associated HTML metadata. So before we can even see the relevant keywords, I needed to filter out the noise. For example, I created a full-text stoplist and added several “noise” terms:

I ended up having to add several stopwords, manually reviewing what I knew was noise given the common subject-matter of the articles. Some terms like “table” or “column” were ambiguous. Were we talking about HTML or SQL in that context? There was not reliable way to know programmatically.

Also, as a short-cut, I scripted out the scripting of the ALTER FULLTEXT STOPLIST and then added the (more) obvious metadata choices:

After adding the various stopwords, I made sure that my stoplist was associated with my full-text index. For example:

After adding the stoplist and rebuilding the catalog, I saw the following top results:

keyphrase totalscore
sql 483.0557612
database 224.5216756
select 156.4048312
query 126.7041208
css 124.2681781
type 110.4776265
databases 99.8270269
varchar 96.96654268
your 96.69461793
microsoft 94.31538226

As you can see, we’re getting closer to “relevant” terms – but there was still some noise. Including “CSS”, “your” and “Microsoft”. I decided to spend a few more minutes removing additional noise terms, including that noise that was related to the subject matter, for example – “Microsoft” may not be noise in some scenarios, but most certainly isn’t a helpful term when looking for statistically significant terms from a Microsoft-centric publication website.

My new top 10 was as follows:

keyphrase totalscore
sql 483.2933343
database 225.7992599
select 158.2329725
query 127.7730246
databases 100.7824847
script 70.22023614
log 61.72487602
column 61.24853854
developers 60.87272082

Is this helpful compared to a standard search engine? It seems unlikely at this point.

What about the top significant terms for a specific article? I tested the following query against an article on Availability Group wait statistics I did back in May of 2012:

In this case I saw the following terms scoped to just my article:

keyphrase totalscore
nvarchar 0.437133938
sql 0.40759635
xel 0.391592562
failover 0.38552931
sqlserver 0.367718428
waitfor 0.352442741
replica 0.348848522
synchronous 0.347228736
duration 0.330882519
concurrency 0.321891427
asynchronous 0.311569184
wait 0.30677563
availability 0.299674034
desc 0.297606289
replicas 0.296193898

How relevant were these terms? Certainly “replica”, “synchronous”, “asynchronous”, “wait”, “availability”, and “replicas” fall into that category. But conspicuously missing are the various wait types themselves. For example, I would have expected to see wait types such as HADR_WORK_QUEUE, HADR_SYNC_COMMIT, and WRITELOG. Looking through the top 100 key phrases (the capped term list), I didn’t see these phrases surfaced as meaningful, even though from a search perspective people are likely to use them as key terms.

In addition to the missing relevant terms, we’re seeing a key limitation with using unigrams because we cannot represent the key phrase “wait statistics” as something that is likely the most relevant term in the entire paper. You can also see how subjective “relevant” is when we’re talking about specific industries or subject areas. How do you output terms that are as relevant for SQL Server as they are for the construction, retail, and medical industries? I’d see this as a very difficult task to achieve without some kind of human intervention.

Another example includes a query for the top terms from Jonathan Kehayias’ “Optimizing tempdb configuration with SQL Server 2012 Extended Events” article:

keyphrase totalscore
sql 0.640196383
sqlserver 0.543729305
in-memory 0.534120202
contention 0.512095034
latch 0.488780767
allocation 0.485334814
gam 0.473063797
predicate 0.468191713
xel 0.466187418

Inability to have the phrase “Extended Events” is a key limiting factor – and with that aside, there was no identification of “tempdb” as a statistically significant word. There were plenty of other relevant words like “allocation”, “bitmap”, “contention”, “gam” (but no “sgam” or “pfs” terms interestingly enough given the frequency), “in-memory”, “histogram” – but would they be relevant in absence of the core subject of the paper?

In conclusion, I feel Semantic Search is an interesting first “alpha” version and I really do hope it is further cultivated by the Microsoft product team. I’m also interesting in hearing how it is applied in real-world scenarios. There may be applications that can benefit from it today, and I’d like to understand where the current boundaries of utility exist. If there is a next version, ideally it includes support for true key phrases. I’d also like to see the ability to extend the Semantic Search model to support industry terminology and custom term relevance weighting. That involves human-intervention of course, which isn’t ideal for various scenarios, but I’d still be willing to do some manual intervention in order in order to find relevant phrases in a library of content.