Yet Another SQL Strategy for Versioned Data

There is a popular design for a database that requires a built-in audit-trail of amendments and additions, where data is never deleted, but merely superseded by a later version. Whilst this is conceptually simple, it has always made for complicated SQL for reporting the latest version of data. Alex joins the debate on the best way of doing this with an example using an indexed view and the filtered index.

In some cases we don’t update our data – we add a new version of it, and we have to keep the previous one as well. This is a well-known practice, particularly in financial systems. Although this helps maintain a good audit trail, it complicates the SQL Code for the selection of a version of the data at a particular point in time, and even just for fetching the latest version of the data. Recently, Grant Fritchey described and benchmarked several typical queries used to do this, in the article SQL Strategies for Versioned data.

I would like to describe one more approach that really shines when we are most concerned with getting the best-possible performance when selecting the latest version, even at the cost of the performance of modifications to the data and other selects.

If you haven’t read Grant’s article yet, I encourage you to read it now, before proceeding to read the rest of this article.

Adding a Column to Explicitly Indicate the Latest Versions Allows for Much Better Performance

Suppose that we store data about employees. Also, let’s assume that the date they were hired is not supposed to change, unless it was entered incorrectly. However, their name and location can change, and we need to keep the old versions. The following tables store all the required data, as well as a column named IsLastVersion, which will be explained later:

Note: I have also provided a script that populates these tables with test data. To download the script, click on the bubble above the article.

We’ve already mentioned that we’ve determined that our highest priority will be in selecting the latest versions. To speed up the selection process as much as possible, we have added a column named IsLastVersion. We shall use this new column to create two database objects that store only the latest versions: an indexed view and/or a filtered index, depending on the queries we are going to optimize. If we only retrieve LocationId and Name from EmployeeCurrentData table, and we do not need any columns from Employees table, then we shall be using the following filtered index:

If we do need columns from Employees table, such as HiredAt, than we need to create an indexed view:

These two new objects, the indexed view and the filtered index, allow for much faster queries, as compared to the ones described in Grant’s article – but only if we explicitly use the condition IsLastVersion=’Y‘ in our queries. We shall benchmark and see that for ourselves.

Verifying the Performance Advantages

Before we begin our benchmarking, let us clearly state an assumption: we expect a lot of versions for every employee. Should we have, for example, just one or two versions for most employees, we might (or might not) get different results. Let us suppose, however, that in our system we are expecting many versions per employee. Our test data has been created using this assumption.

Suppose that we only need the latest versions of Name and LocationId. If we use IsLastVersion column in our query, it uses our filtered index and is very fast. Let us run it, as well as the equivalent query not using IsLastVersion:

The first thing that we can notice immediately is that the first query has a much simpler plan - all the data it needs is stored in our filtered index. The execution plan is so simple that there is no need to show it here.

The second query acceses two tables, joined with nested loops, which should use up somewhat more CPU, and should need substantially more reads:

1313-image_01small.png

When we actually run both queries, we see that the first one is requiring fewer reads and using less CPU. However, both queries run for almost the same time on my laptop, as shown in the following table (naturally your mileage may vary):

 

Query that uses IsLastVersion

Query that does not use IsLastVersion

Duration

Appr. 1400 ms

Appr. 1400 ms

Reads

349

319209

CPU

78 ms

531 ms

These results may look a little bit confusing: how is it possible that the first query uses 6.8 times less CPU, does 914 times less reads, yet it runs for almost the same time as the second one?

The reason is simple: in both cases most of the time is spent transferring 100K rows and rendering the result set in SSMS. This is easy to demonsrate. Let us materialize the results in a table, and just select all the data from that table, as follows:

As expected, we are getting the same duration: this last query is as simple as it goes, and it runs for approximately 1400 ms, just like the previous two ones.

Let us change our benchmarking just a little bit: instead of comparing the performance of a executing a query, transferring 100K rows, and rendering them on the client, let us filter out most of the rows, so that only a small result set is sent over and displayed. Clearly we want this filtering to be non-intrusive: it should not change the execution plans, and it should not use too much CPU. For these purposes I have added the following condition to both queries:

This numeric comparison is a very cheap; it does not use much CPU. Also it is not index-friendly, so we are still getting the same execution plans for both queries. The modified queries are as follows:

Both queries with this additional condition return only one row, the cost of transferring and displaying one row is small as compared to the cost of running queries, and the duration now reflects the difference in queries' performance just as well as CPU and reads do:

 

Query that uses IsLastVersion

Query that does not use IsLastVersion

Duration

40 ms

372 ms

Reads

349

319209

CPU

47 ms

656 ms

Note: although the second query used 656 ms, it used a parallel plan and as such completed in only 372 ms - these 656 ms were spread over two processors.

Now we'll benchmark the query that involves having to retrieve columns from both the Employee and EmployeeDetails tables. For this case we have already created an indexed view which uses IsLastVersion. As before, we shall add the same filtering to exclude from our comparison the costs of sending and rendering a large result set.

The following two queries retrieve the same results, the same one row. The first one uses IsLastVersion, the second one does not:

To satisfy the first query, the database engine simply scans our indexed view. The second query gets a nested-loopish execution plan, just as in the previous case.

The performance advantages of using IsLastVersion, both in reads and in CPU, are just as impressive as in the previous case:

 

Query that uses IsLastVersion

Query that does not use IsLastVersion

Reads

411

320033

CPU/Duration

15 ms/77ms

484 ms/261 ms

As in the previous case, the duration of the second query is less than the amount CPU it used because the second query runs in parallel.

As we have seen, using IsLastVersion column allows for much better performance. However, can we guarantee that it always stores correct values? In most cases we choose slower queries which always return correct values over faster queries that can sometimes return wrong results. So, we have to ensure the complete correctness of all the values in IsLastVersion column.

Enforcing the Integrity of the IsLastVersion Column

In this section we shall enforce the following data integrity rules:

For every row in Employee table that has any corresponding rows in EmployeeDetails table, there is exactly one row in EmployeeDetails table with IsLastVersion='Y' Every row in EmployeeDetails table with IsLastVersion='Y' has the biggest VersionNumber for that employee

Our filtered index, EmployeeDetails_CurrentVersions, already guarantees that there is at most one last version per employee. The following script completes the implementation of these two data integrity rules:

Essentially this constraint means that if a version is not marked as the last, then it must refer to a next version, with the next value of VersionNumber. This means that we can only insert a row with the largest version number if and only if it is marked as IsLastVersion='Y'.

As long as these constraints are trusted, we can safely use IsLastVersion='Y' in our queries, and benefit from much faster queries.

Note: as a side effect of this foreign key constraint, we cannot have gaps in version numbers. For example, if we have versions 2 and 4, than we must have version 3 as well. Typically this is a very good thing - it enforces data integrity. The existing solution does not guarantee that we have version number 1, but this is only because the emphasis of this article is on performance, and for best performance we only require that the last version is marked as such. For better data integrity, the existing solution can be easily tweaked to also enforce the rule that version number 1 always exists. I'll leave this as an advanced exercise.

Naturally, when we insert a new latest version, we have to unmark the current latest version, settting it to IsLastVersion='N', because we cannot have more than one latest version per employee. We have to both modify the previous latest version, and insert the new row in the same statement, otherwise we shall have a constraint violation. The following MERGE command shows how to do it:

This MERGE command in its current form will only work if we already have at least one version added before. However, it is very easy to change it, so that it can add the very first version as well. I'll leave you to figure out how to do this!

Conclusion

If we add a pre-calculated column IsLastVersion, we can benefit from much faster selecting of the latest version, but we have to pay a price for it: slower and more complex modifications. In some cases this price is too high.

Let me reuse the argument from my previous article, "Tuning SQL queries with the help of constraints" (http://www.simple-talk.com/sql/t-sql-programming/tuning-sql-queries-with-the-help-of-constraints/).

If we need high quality video and fast playback, we invest in burning a DVD. If we need acceptable quality video, and we need a cheap solution, then a webcam is a better choice.

The approach described in this article is not the right one for all circumstances: It only recommended for those cases where our highest priority is both the performance of selecting the latest version and the quality of results.

I would like to thank Plamen Ratchev, Phil Factor and Grant Fritchey for reviewing this article.