The Case of the Skewed Totals

Even when your code tests out perfectly in the standard test cell, you can experience errors in the real production setting where several processes are hitting the database at once, in unpredictable ways. You shouldn't, of course, let it get that far, because there are now ways of simulating concurrency during the test process.

Concurrency in databases may cause subtle, difficult to explain, and difficult to reproduce problems. For example, what will happen when several processes are writing to and reading from the database at the same time, in unpredictable ways? Will your reports still be completely correct? Not necessarily.

Believe it or not, but when you simultaneously modify data and run aggregate queries against the same data, your aggregates may intermittently be slightly incorrect. This phenomenon  may go unnoticed for some time, and even when it is noticed, it is difficult to reproduce – in fact such problems are impossible to reproduce without simulating concurrency. This is a very common real life problem. In this article I will reproduce this phenomenon and demonstrate how the accuracy of the information returned by  your SELECT statements may depend on the isolation level.

Prerequisites.

We shall be reusing the StressTests database and StressTester class which were described in the previous article ‘Close Those Loopholes: Stress-Test those Stored Procedures‘ . Make sure that read committed snapshot isolation is turned off by executing this code::

Also you will need a few additional tables and stored procedures which are provided in StressTest_MoreDDL.sql file.

Reproducing a select returning incorrect totals.

Before stress testing, your test harness will generate some test data – exactly 10000 users and exactly 10000 tickets assigned to them. During stress testing one thread will be randomly re-assigning tickets to other users, without adding or removing any tickets whatsoever:

 At the same time another thread will be counting all the tickets in the system:

The tables that store users and tickets and the stored procedure which counts the tickets are as follows:

Note that it uses the default READ COMMITTED isolation level. Although there are exactly 10000 tickets in the system at any time, the totals in many cases will be slightly different. In fact, when I run the test, the totals are usually incorrect in about 50% cases :

The full test harness is provided in TotalsTest.cs.

As you have seen, the default READ COMMITTED isolation level does not guarantee correct totals, but why is that? Is it some kind of bug in SQL Server? In fact, no, this is not a bug, SQL Server works as designed. The explanation of incorrect totals is very simple: as the select scans the index on AssignedTo column, it does not keep locks on index pages after it has finished reading them. So, if a ticket is reassigned, the corresponding index entry can be moved to another index page. As such, it can be either read twice or not read at all. Note that in this case the totals can both slightly exceed or be slightly less than 10000.

Trying out REPEATABLE READ isolation level.

Let us up the isolation level of the selects to REPEATABLE READ:

Rerun the stress test. This time the number of incorrect totals has decreased and in this case all incorrect totals are less than 10000. There are no incorrect totals exceeding 10000 this time. The reason is simple: to ensure that the reads are repeatable, this isolation level may prevent an index entry from moving to another index page. This means that a reassigned ticket will not be counted twice. However, this isolation level does not prevent index entries that were not counted yet from moving to an already scanned page.

Trying out SERIALIZABLE isolation level.

Let us up the isolation level to SERIALIZABLE:

Rerun the stress test. This time all the totals are always correct, but there are so many deadlocks (on my laptop up to 50% of selects are deadlock victims) that the price for 100% correctness of results is in many cases prohibitive.

Using snapshot isolation.

Let us make sure that snapshot isolation is enabled, as follows:

These selects and modifications do not block each other. As a result, all your selects always return correct results, and there are no deadlocks involving a reader and a writer. Also because there is no lock waiting, the stress test completes significantly faster. Rerun the stress test and see for yourself.

Using dirty reads aka READ UNCOMMITTED isolation level.

Run the following SQL script and rerun the stress test:

Note that your selects run faster and you are getting less incorrect totals than when you were using the default READ COMMITTED isolation level. The explanation is straightforward: selects running under READ UNCOMMITTED isolation level acquire less locks. As such, they run faster, so the chance of an index entry counted twice or not counted at all is smaller.

Conclusion

Solid understanding of concurrency is essential in database programming. I think that experimenting with concurrently running queries is a very good way to acquire such understanding. In this article I have demonstrated why selects may work perfectly without concurrency yet intermittently fail in real life multi user environments.

As you have seen, in this particular case the best choice is clear: the query running under snapshot isolation always returns correct results, and it never fails due to deadlocks. Although the best choice in your particular situation may be different, the choice of isolation level may be very important. Do not make this choice lightly – make sure that you verify your choice against a realistic amount of data, and perform enough realistic concurrency tests.