Full Text Searches on Documents in FileTables

Comments 0

Share to social media

Along with SQL 2012 came FileTable technology, which simplifies the management of large volumes of structured and unstructured documents. In this article Feodor describes the technology and how it can be valuable.

In this article I will cover the following topics:

  • What is FileTable technology?
  • How to set it up?
  • How to prepare the indexing for the FileTable?

After the introduction to FileTable, I will give you the code that provides an example that, if you wish, you can run yourself, using Wikipedia articles used together with FileTable and Full-text search. The example serves to illustrate

  • How to get out the XML from rows and save to FileTable
  • What files can FTS deal with
  • How to use FTS on FileTable

What is FileTable technology?

In short, FileTable is a new technology closely related to Filestream in SQL Server. Filetable enables unstructured (and semi-structured) data files to be stored and managed as part of a SQL Server database, and at the same time the files can be accessible via the Windows file system.

Furthermore, the Filestream and FileTable technologies can be combined with powerful Full-Text and Semantic searches in order to allow the user to take full control of document handling in an office.

The task at hand and how to handle overabundance of documents

For the purpose of this article I will use an existing database called WikiDB which contains about 12 million Wikipedia articles stored in XML format in an XML datatype column in the WikiDocument table.

In this exercise we will create a new database called WikiFilesDB, which utilizes the Filestream, Filetable and Full-Text searches, then we will export 50,000 of the articles from the WikiDocument table and will save them to WikiFIles table.

After the migration we will create a Full-Text search catalog and run some queries on the documents.

How to set up FileTable in a database

If you have not yet enabled FileStream for your SQL Server, you’ll need to do it via the SQL Server Configuration Manager, or maybe execute some PowerShell to do it:

Then, in TSQL, you can run…

And now let’s create a table which will contain the references to the files:

How to prepare the indexing for the FileTable

Note that during the creation of the FileTable I have explicitly specified the name of the Unique constraint for the StreamId column by specifying the optional FILETABLE_STREAMID_UNIQUE_CONSTRAINT_NAME=UQ_stream_id.

This way, we get a unique constraint created together with a non-clustered index which later on we can use for the Full-Text index.

In other words, instead of creating an extra index like this:

We can actually skip the creation of the extra index altogether and create the objects like this:

How to get out the XML from rows and save to FileTable?

Each application using FileTables will have a different way of stocking the data, depending on where it comes from, but in our example we’ll use a database which contains about 12 million wikipedia articles. For this database called WikiDB, the idea was to take the entire data dump from Wiki: http://dumps.wikimedia.org/enwiki/latest/. The files are 27 and are between

enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2

to

enwiki-latest-pages-articles27.xml-p029625017p037804211.bz2

The Wiki files are in XML format, and they contain multiple <page> XML nodes and each of them corresponds to a Wiki page.

The WikiDB database has a table which contains about 12 million rows and each one corresponds to a page.

The idea was to create an easy way to store the entire Wikipedia on my laptop. To do that, we’ll need to take all Wikipedia articles and to store them on the filesystem as part of a database which utilizes the FileTable technology.

Furthermore, the intention is to use Full-text search to search for words and phrases through the Wiki pages, while they are still available to other services.

At a first sight there is a slight problem with this task: how do we get the XML out of the SQL Server table? After all, SQL Server is notorious for having the ease of import of XML , yet a great difficulty in exporting it back to the file system. However with the FileTable technology it is not hard at all. All it takes is a simple query. We’ll start off by importing just one file just to ‘test the plumbing’.

What this code does is to simply take a row from the [WikiDB].[dbo].[WikiDocuments] table, converts the XML datatype to varbinary and saves it to the [WikiFilesDB].[dbo].[WikiFiles].

And if we right-click on the FileTable in WikiFilesDB, we can open an Windows Explorer window and see the file we just imported:

1718-5e936aae-5ea7-4cf4-93a7-0aa6208f137

From here on, I can copy the XML file and paste it to a different folder, I can even modify the XML file with the use of external editor.

But how can we search the contents?

What files can FTS deal with?

The first question is: what file types can Full-text search handle?

The answer is just a matter of executing the following DMV:

There are 50 filetypes supported at this time, and XML is one of them.

How to use FTS on FileTable?

Earlier in this article I already created the Full-Text index on the FileTable.

Now I will actually execute a loop which inserts the XML files to the FileTable:

For the purpose of this article, I will work with only 50,000 XML files, and in a later article I will explain the performance considerations for FileTable, including the scalability and performance of various file counts.

After completing the export and import of XML files to WikiFilesDB, it is time to re-build the Full-Text search catalog.

Before we hit the ‘Rebuild’ button, let’s prepare a script which will help us find out the status and how long it took to rebuild the catalog. For this purpose we are going to look into the default trace and I will slightly modify a script from my previous article “The default trace in SQL Server – the power of performance and security auditing“:

Now let’s start the catalog rebuild with the following script and then let’s look at the default trace:

When we execute the default trace query, we will get the start time of the Full-Text catalog rebuild:

1718-2152135e-9f2f-434c-be4b-3acc3445a2d

Later on we can run the query again, and we will see the finish time of the Full-Text rebuild:

1718-971c6490-a81f-4f2e-915e-d90a71791bd

And here is another way to see the properties and the status of the Full-Text Catalog:

When we execute this query, we will get the following result:

1718-9d12d7c4-4500-42f9-aa56-dee31b97e2d

And finally, let’s write some queries which will allow us to find words and phrases from the XML files we just imported. Keep in mind that the Full-Text search in this case is ignoring the markup nodes of the XML files and only the content is indexed.

Here are few various Full-Text queries:

All of the above queries return only the handle to the documents and their properties, i.e. the contents as a binary string, the document name, the filesystem properties.

This is perfectly fine, since in a real-world situation we would like to have nothing more than a handle and we would like to process the documents on the application side and not to use SQL Server to return the entire dataset.

The result from the queries above looks like this:

1718-cd5833c2-6c86-4b7c-bee6-f7bccdea3a8

If we really wanted to see the documents’ contents in SQL Server Management Studio, we could run the following query:

And the result will look like this:

1718-4dd6ceca-ae13-4e7e-9831-dc91ed69c19

Conclusion

The Full-Text search combined with FILESTREAM and FileTable technologies in SQL Server 2012 can be a very powerful tool for the business. Keeping various documents stored in a central location and performing blazingly fast searches can improve the workflow for many people in the organization. Furthermore, it is very easy to export files stored within different databases and import them to FileTable storage in SQL Server, and thus making them available to the file system and other applications.

Load comments

About the author

Feodor Georgiev

See Profile

Feodor has a background of many years working with SQL Server and is now mainly focusing on data analytics, data science and R.

Over more than 15 years Feodor has worked on assignments involving database architecture, Microsoft SQL Server data platform, data model design, database design, integration solutions, business intelligence, reporting, as well as performance optimization and systems scalability.

In the past 3 years he has expanded his focus to coding in R for assignments relating to data analytics and data science.

On the side of his day to day schedule he blogs, shares tips on forums and writes articles.