Semantic Search is the new feature introduced in SQL Server 2012 that allows you to extract statistically relevant words, referred to by Microsoft as ‘key terms’, from unstructured documents stored in SQL Server. In addition to key term extraction, you can also find documents that are similar to one another based on the shared relevant key terms.
Note: I say “key terms” but you may see official Microsoft documentation call them “key phrases.” This is a bit of a misnomer because as of SQL Server 2012 RTM, these “phrases” are unigrams (single word terms). I do hope we’ll see this expand in future versions to support bigrams, trigrams and beyond.
If you’re wondering whether you can use Semantic Search for your own application requirements, I strongly recommend you test this functionality using your own data. The viability of this new feature depends on the nature of the data being explored. For example, if you’re looking to create an application that pulls unigrams from a set of customer documents, you’ll want to make sure that the terms that were extracted are truly relevant to the subject matter and not assume that all subject areas have equivalent support by the Semantic Search feature.
I’ve experimented with various types of SQL Server related documents in the past and I do see that in some cases the key terms that are extracted are what I would expect. However in several cases I’ve also found that terms extracted are not what I expected.
Let’s take an example of exploring a library of technical articles published by Simple-Talk. My Red Gate editor provided me with a database containing 1,388 articles. For the table containing these articles, I created an associated full-text index and enabled the Semantic Search feature. The table name containing the article text is called dbo.xlaANMarticles.
So to begin with, I asked a question about which terms were most significant across the entire library of articles? I asked this question using the following query, leveraging the SEMANTICKEYPHRASETABLE function:
1 2 3 4 5 6 7 8 9 |
SELECT TOP (10) [skp].[keyphrase], SUM([skp].[score]) AS [totalscore] FROM [dbo].[xlaANMarticles] AS wp INNER JOIN SEMANTICKEYPHRASETABLE(xlaANMarticles, article) AS [skp] ON [wp].[articleid] = [skp].[document_key] GROUP BY [skp].[keyphrase] ORDER BY SUM([skp].[score]) DESC; GO |
This query returned the following 10 rows:
keyphrase | totalscore |
font-family | 883.2972255 |
font | 844.5126802 |
font-weight | 789.5226674 |
margin-left | 775.4817398 |
font-size | 763.5196039 |
span | 720.4189129 |
nbsp | 701.175603 |
background-color | 689.4987194 |
padding-left | 687.912422 |
padding | 682.9183152 |
As you can see – we’re getting noise. The articles are stored in the table with the original article text and associated HTML metadata. So before we can even see the relevant keywords, I needed to filter out the noise. For example, I created a full-text stoplist and added several “noise” terms:
1 2 3 4 5 6 7 8 9 10 11 |
CREATE FULLTEXT STOPLIST [slArticles]; ALTER FULLTEXT STOPLIST [slArticles] ADD 'font-family' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'font' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'font-weight' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'margin-left' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'span' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'nbsp' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'background-color' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'padding-left' LANGUAGE 'English'; ALTER FULLTEXT STOPLIST [slArticles] ADD 'padding' LANGUAGE 'English'; GO |
I ended up having to add several stopwords, manually reviewing what I knew was noise given the common subject-matter of the articles. Some terms like “table” or “column” were ambiguous. Were we talking about HTML or SQL in that context? There was not reliable way to know programmatically.
Also, as a short-cut, I scripted out the scripting of the ALTER FULLTEXT STOPLIST and then added the (more) obvious metadata choices:
1 2 3 4 5 6 7 8 9 10 |
SELECT TOP (100) 'ALTER FULLTEXT STOPLIST [slArticles] ADD ''' + skp.keyphrase + ''' LANGUAGE ''English'';', SUM([skp].[score]) AS [totalscore] FROM [dbo].[xlaANMarticles] AS wp INNER JOIN SEMANTICKEYPHRASETABLE([xlaANMarticles], [article]) AS skp ON [wp].[articleid] = [skp].[document_key] GROUP BY [skp].[keyphrase] ORDER BY SUM([skp].[score]) DESC; GO |
After adding the various stopwords, I made sure that my stoplist was associated with my full-text index. For example:
1 2 3 4 5 6 |
CREATE FULLTEXT INDEX ON [dbo].[xlaANMarticles] ([article] LANGUAGE 1033 STATISTICAL_SEMANTICS) KEY INDEX [PK_articles] ON [ft_SimpleTalkArticles] WITH STOPLIST = [slArticles]; GO |
After adding the stoplist and rebuilding the catalog, I saw the following top results:
keyphrase | totalscore |
sql | 483.0557612 |
database | 224.5216756 |
select | 156.4048312 |
query | 126.7041208 |
css | 124.2681781 |
type | 110.4776265 |
databases | 99.8270269 |
varchar | 96.96654268 |
your | 96.69461793 |
microsoft | 94.31538226 |
As you can see, we’re getting closer to “relevant” terms – but there was still some noise. Including “CSS”, “your” and “Microsoft”. I decided to spend a few more minutes removing additional noise terms, including that noise that was related to the subject matter, for example – “Microsoft” may not be noise in some scenarios, but most certainly isn’t a helpful term when looking for statistically significant terms from a Microsoft-centric publication website.
My new top 10 was as follows:
keyphrase | totalscore |
sql | 483.2933343 |
database | 225.7992599 |
select | 158.2329725 |
query | 127.7730246 |
databases | 100.7824847 |
script | 70.22023614 |
log | 61.72487602 |
column | 61.24853854 |
developers | 60.87272082 |
Is this helpful compared to a standard search engine? It seems unlikely at this point.
What about the top significant terms for a specific article? I tested the following query against an article on Availability Group wait statistics I did back in May of 2012:
1 2 3 4 5 6 7 8 9 10 |
SELECT TOP (15) [skp].[keyphrase], SUM([skp].[score]) AS [totalscore] FROM [dbo].[xlaANMarticles] AS [wp] INNER JOIN SEMANTICKEYPHRASETABLE([xlaANMarticles], [article]) AS [skp] ON [wp].[articleid] = [skp].[document_key] WHERE [headline] = 'A first look at SQL Server 2012 Availability Group Wait Statistics' GROUP BY [skp].[keyphrase] ORDER BY SUM([skp].[score]) DESC; GO |
In this case I saw the following terms scoped to just my article:
keyphrase | totalscore |
nvarchar | 0.437133938 |
sql | 0.40759635 |
xel | 0.391592562 |
failover | 0.38552931 |
sqlserver | 0.367718428 |
waitfor | 0.352442741 |
replica | 0.348848522 |
synchronous | 0.347228736 |
duration | 0.330882519 |
concurrency | 0.321891427 |
asynchronous | 0.311569184 |
wait | 0.30677563 |
availability | 0.299674034 |
desc | 0.297606289 |
replicas | 0.296193898 |
How relevant were these terms? Certainly “replica”, “synchronous”, “asynchronous”, “wait”, “availability”, and “replicas” fall into that category. But conspicuously missing are the various wait types themselves. For example, I would have expected to see wait types such as HADR_WORK_QUEUE, HADR_SYNC_COMMIT, and WRITELOG. Looking through the top 100 key phrases (the capped term list), I didn’t see these phrases surfaced as meaningful, even though from a search perspective people are likely to use them as key terms.
In addition to the missing relevant terms, we’re seeing a key limitation with using unigrams because we cannot represent the key phrase “wait statistics” as something that is likely the most relevant term in the entire paper. You can also see how subjective “relevant” is when we’re talking about specific industries or subject areas. How do you output terms that are as relevant for SQL Server as they are for the construction, retail, and medical industries? I’d see this as a very difficult task to achieve without some kind of human intervention.
Another example includes a query for the top terms from Jonathan Kehayias’ “Optimizing tempdb configuration with SQL Server 2012 Extended Events” article:
keyphrase | totalscore |
sql | 0.640196383 |
sqlserver | 0.543729305 |
in-memory | 0.534120202 |
contention | 0.512095034 |
latch | 0.488780767 |
allocation | 0.485334814 |
gam | 0.473063797 |
predicate | 0.468191713 |
xel | 0.466187418 |
Inability to have the phrase “Extended Events” is a key limiting factor – and with that aside, there was no identification of “tempdb” as a statistically significant word. There were plenty of other relevant words like “allocation”, “bitmap”, “contention”, “gam” (but no “sgam” or “pfs” terms interestingly enough given the frequency), “in-memory”, “histogram” – but would they be relevant in absence of the core subject of the paper?
In conclusion, I feel Semantic Search is an interesting first “alpha” version and I really do hope it is further cultivated by the Microsoft product team. I’m also interesting in hearing how it is applied in real-world scenarios. There may be applications that can benefit from it today, and I’d like to understand where the current boundaries of utility exist. If there is a next version, ideally it includes support for true key phrases. I’d also like to see the ability to extend the Semantic Search model to support industry terminology and custom term relevance weighting. That involves human-intervention of course, which isn’t ideal for various scenarios, but I’d still be willing to do some manual intervention in order in order to find relevant phrases in a library of content.
Load comments