Text Indexes

The Oracle database has many features that are not spoken of frequently and a recent note by Connor McDonald reminded me that there are a couple of notes about optimising Text indexes that have been on my “must write” list for several years. If you’re not familiar with what text indexes can do for you, here’s a thumbnail sketch: create

The Oracle database has many features that are not spoken of frequently and a recent note by Connor McDonald reminded me that there are a couple of notes about optimising Text indexes that have been on my “must write” list for several years.

If you’re not familiar with what text indexes can do for you, here’s a thumbnail sketch:

I haven’t shown you the bit of code between the create table and the create index, but it populates the table with the SQL scripts in my library. Each row has been given a unique id and each file appears by name (file_name) and as a “binary file lob” (file_handle). I’ve also included a file_type column which I’ve set to the value ‘TEXT’ for every row.

The second statement creates a text index on the file_handle column and I’ve supplied a couple of parameters to Oracle telling it how much (PGA) memory to allocate to the task of creating the index; and I’ve promised that each file can be analyzed safely as a pure text file (rather than letting Oracle spend time working out what type of file each one is) by specfiying that the format of the object will be defined in the file_type column. Given the way I’ve loaded the table I could also create a text index on the file_name column by changing the parameters slightly.

The final statement is a simple example of how I can now find filenames for files that contain both the words “partition” and “parallel”, listing only the ones where both words appear fairly commonly. The contains() operator tells Oracle to use the text index on the file_handle column, tells it what to search for, and attaches a numeric tag to each search. The operator returns a non-zero value if the index finds any matches, and the score() function in the select list returns a measure (an integer between 1 and 100) reporting the relative frequency of the searched expression in each referenced document.

I won’t go into all the details of how this works, or what preparations you might have to make to use the features you want (there’s an entire manual for that); I’ll only mention that I granted execute on the package ctxsys.ctx_ddl to my standard role before I did anything else.

Performance Considerations

I mentioned Connor McDonald’s note at the start of the article. In it he pointed out that you could define a Text index to “sync on commit”, i.e. always be perfectly up to date with the contents of the base table; however he also demonstrated that the insert statement suffers a noticeable overhead when doing this – one that you would notice if you were constantly making small changes to the data. For some activity the impact might not be significant; for other activity you may prefer to allow Oracle to accumulate a lot of changes and then bring the index up to date through a batch process.

It’s not just the overheads of loading the data that matters, though: unlike ordinary B-tree indexes a Text index can become extremely inefficient (in fact the word “fragmented” is actually appropriate) because of the way you choose to maintain it. To explain this we have to take a little look inside the index.

After I had created my index I found the following new tables in my schema:

These are the tables supporting the generic Oracle Text infrastructure. In fact, digging a little deeper, you would find that two of these tables are IOTs (index organized tables); and two of the table hold LOBs. And none of these objects acquires statistics when you try to gather stats on the schema!

For the work we’ve done so far the most interesting tables are DR$SH_IX_HANDLE$I and DR$SH_IX_HANDLE$K which hold 86716 and 16456 rows respectively. The “K” table is a cross-reference list (one of the IOTs) between the index data and my original script_handles table (holding rowids from the script_handles table!). The “I” table is the token table, holding a list of tokens that have appeared across all the files listed in my script_handles table with, for each  token, a list of every file the token appeared in and a note of where it appeared. Here’s the table definition for the “I” table:

The token_text holds the value of the token (basically it’s likely to be a word), token_info is (as you can see) a BLOB which carries a stream of document references and locations, and token_count tells you how many times the token_text is referenced in the BLOB. With this table in mind we can finally start thinking about performance – both for creating or rebuilding an index, and when using the index.

To identify the documents containing a given token the first step that Oracle has to take it to find the row, or rows, that hold the token text. Ideally all the data about that token will be stored in one place – but that’s not actually very likely to happen although it is possible to do some work that minimises the degree to which the token information is scattered across the “I” table.

To build a text index Oracle will read through your set of documents extracting and maintaining a list of tokens, building a reference blob for each token as it goes. When memory is full Oracle dumps the list to disc – by writing it into the “I” table – then carries on reading through your set of documents building another set of reference blobs for each token. If you don’t give Oracle very much memory to do this job each token could turn into a large number of rows in the “I” table with just a few references in each blob. So a larger memory allocation can lead to a more efficient index.

If you are aware of this mechanism you may find some way to collate your documents so that documents with similar content are indexed at (roughly) the same time because if you can make this happen then during the index build you could find that each batch that Oracle processes consists of a smaller number of tokens with longer reference blobs rather than a larger number of tokens with short reference blobs.

To demonstrate these two points I’ve run up a simple demo. I have about 4,100 scripts in my library, and I’ve loaded their filenames into a table called script_names. I then populated the script_handles table with four copies of every row in the script_names tables and created the text index.

I’ve done this with two variations on two different approaches: small memory (8MB) vs. large memory (128MB), and clustered data (all 4 copies of a file in consecutive rows) vs. scattered data (copies 4,100 rows apart from each other).

Since the data set was really fairly small on a high-end laptop the timings didn’t show much variation but there are some interesting results. The key figures I’ve picked are:

  • Time to create index in seconds
  • Number of rows in token table
  • Number of tokens with more than 4 rows in the token table
  • Number of full blocks (dbms_space_use package) in token table.

First the small memory model:

  Clustered Scattered
Time to build (sec) 28.28 29.21
Rows in token table 144,307 382,603
Tokens with more than 4 rows 3,206 10,836
Full blocks 3,961 5,904

The difference in timing of one second isn’t terribly significant at this scale – but it was interesting to note that according to the session statistics and wait events Oracle spent 15 seconds opening and closing the Bfiles, 15 seconds on CPU, and 5 seconds on various db file I/O events: somewhere it was double counting the time.

Important differences do appear in the fragmentation of the token information and the total space used in the token table. (The BLOB is declared as “enable storage in row” and many of the blobs reached a size that left lots of 8KB blocks “full” when they were still nearly half empty. I’ll come back to that later.) If your memory is small relative to the data you have to process then you’re not going to get the most efficient index possible.

Increasing the memory to 128M (the maximum allowed is 256 GB) for the build (the summed size of the 16,400 files was about 88MB) I got the following results:

  Clustered Scattered
Time to build (sec) 28.11 29.73
Rows in token table 86,718 87,692
Tokens with more than 4 rows 233 235
Full blocks 3,521 3,544

With a large enough memory there’s very little difference in storage between the clustered and the scattered data, though it’s just a little odd that the larger memory resulted in a slight increase in CPU for the scattered data.

The interesting result that stand out here is that although we appear to have enough memory to process the entire dataset in a single pass we still have tokens whose information has been split into more than 4 pieces – this is strange and made me wonder if perhaps Oracle’s algorithm restricts the token_info blobs to the “in row” limit of 3,960 bytes.

This seems to be close to the truth but it’s not the whole truth – there were a few tokens where the token_info was much longer than 3,960 bytes and when I examined a few specific examples it looked as if they were all extreme cases where the same token appeared literally hundreds of times in a single document:

The one file (4 copies, remember) holding this token was a file called dbwr_writes.sql and I had copied a big chunk of a redo log file dump into it to capture 8,800 “Block Written” change vectors (“Block” and “Written” were two more anomalous index entries). It looks like Oracle keeps all the information about the appearance of a token in a single file in a single row of the token table even if this means generating a blob that exceeds the in-line limit and produces an out of line blob.

So Oracle will create multiple rows for a single token even if it’s read all the data it needs into memory in a single pass and this may make you think that there’s no great point in trying to optimise the index by using lots of memory to the build. Remember, though, that if all the rows for a token have been created at the same time they are likely to be stored in blocks that are very close to each other, so you may still get some benefit from caching effects.

Waste Space

Checking a few examples of tokens spreading over multiple rows reminded me of another important aspect of optimising text indexes. Bear in mind that my “documents” are just a set of SQL scripts as you look at the following result:

That’s 1.3MB (out of a total of 18MB of mapping information) that I really don’t need. I have lots of remarks (and lots of text with the word PROMPT at the start) in my scripts, and I’m not really going to want to find “all SQL scripts with the word REM in them”. So Oracle allows me to define a “stop list” to eliminate all the words I’m not interested in. If you want to see the default stop list you can query the table ctxsys.dr$stopword – there are 114 words (including, for example, the word “the”) in the list.

I’m going to create a list with a few extras; this takes two steps – first we create a stoplist, then we modify the “create index” statement to include the stop list:

As you can see, the change to the create index statement is minor, I’ve just added a STOPLIST reference to the parameter list. To create the stoplist I’ve made a few calls to the ctx_ddl package – first I create a stoplist with a fairly arbitrary name, then I add a list of words to that stoplist. After that I’ve used a little inside information to read the “ctxsys.default_stoplist” and copied it into my stoplist, then I’ve added the (one and only) “stopclass” called “Numbers” (which catches everything that looks like a numeric) to my stoplist. A stoplist can be up to 4,095 entries – which is a fairly large percentage of the average number of words a typical person uses, so it could take a lot of redundant data from your index.

After dropping and creating the index with this stoplist this is the set of statistics that appeared from the “well clustered, large memory” test:

  StopListed Original
Time to build (sec) 28.45 28.11
Rows in token table 45,109 86,718
Tokens with more than 4 rows 202 233
Full blocks 2464 3,521

The difference in volume is significant – and in a serious production system I might even check for, and include, anomalous entries like the “BFT” and “AFN” produced by my redo log file dump – anything that allows Oracle to process more documents in the memory I’ve allocated before dumping to disc is a good thing.

Conclusion

For a large volume of text you probably won’t be able to build an entire text index in memory, but allocating as much memory to the task as possible will probably help make the index as efficient as possible.

If you can’t do a complete in-memory build then if you can find a way of grouping documents by similar content as they are loaded then you may find (purely as a side effect) that you can get fewer, larger, index entries for each token.

Finally, remember that you can define a list of stopwords that won’t be included in the index – and words that appear very frequently are probably exactly the words that you won’t be interested in indexing, so eliminating them may make a big difference to how well Oracle can use the available memory when creating the index.