The Case of the Skewed Totals

Concurrency in databases may cause subtle, difficult to explain, and difficult to reproduce problems. For example, what will happen when several processes are writing to and reading from the database at the same time, in unpredictable ways? Will your reports still be completely correct? Not necessarily.

Believe it or not, but when you simultaneously modify data and run aggregate queries against the same data, your aggregates may intermittently be slightly incorrect. This phenomenon may go unnoticed for some time, and even when it is noticed, it is difficult to reproduce – in fact such problems are impossible to reproduce without simulating concurrency. This is a very common real life problem. In this article I will reproduce this phenomenon and demonstrate how the accuracy of the information returned by your SELECT statements may depend on the isolation level.

Prerequisites.

We shall be reusing the StressTests database and StressTester class which were described in the previous article ‘Close Those Loopholes: Stress-Test those Stored Procedures‘ . Make sure that read committed snapshot isolation is turned off by executing this code::

1 2	ALTER DATABASE StressTests SET READ_COMMITTED_SNAPSHOT OFF

Also you will need a few additional tables and stored procedures which are provided in StressTest_MoreDDL.sql file.

Reproducing a select returning incorrect totals.

Before stress testing, your test harness will generate some test data – exactly 10000 users and exactly 10000 tickets assigned to them. During stress testing one thread will be randomly re-assigning tickets to other users, without adding or removing any tickets whatsoever:

internal class TicketsShuffler : IStressTestable

{

private SqlCommand _command;

private int errors = 0;

public void Prepare(SqlConnection connection)

{

_command = connection.CreateCommand();

_command.CommandText = "Writers.ReassignTicket";

_command.CommandType = CommandType.StoredProcedure;

_command.Parameters.AddWithValue("@TicketId", 0);

_command.Parameters.AddWithValue("@AssignedTo", 0);

}

public void Run()

{

DateTime startedAt = DateTime.Now;

Random random = new Random();

const int numIterations = 8000;

for (int i = 0; i < numIterations; i++)

{

try

{

_command.Parameters[0].Value = Convert.ToInt32(random.Next(1, 10000));

_command.Parameters[1].Value = Convert.ToInt32(random.Next(1, 10000));

_command.ExecuteNonQuery();

}

catch (Exception e)

{

Console.WriteLine(e);

errors++;

}

if (errors > 10)

{

break;

}

Console.WriteLine(

string.Format("TicketsShuffler: {0} completed, {1} failed, duration {2}", numIterations - errors, errors,

DateTime.Now.Subtract(startedAt)));

}

At the same time another thread will be counting all the tickets in the system:

internal class TotalsInLoopReader : IStressTestable

{

private SqlCommand _command;

private readonly int[] numbers = new int[2000];

private int currentIndex = 0;

private int errors = 0;

public void Prepare(SqlConnection connection)

{

_command = connection.CreateCommand();

_command.CommandText = "Readers.SelectTotalTickets";

_command.CommandType = CommandType.StoredProcedure;

}

public void Run()

{

DateTime startedAt = DateTime.Now;

for (int i = 0; i < numbers.Length; i++)

{

try

{

numbers[currentIndex++] = Convert.ToInt32(_command.ExecuteScalar());

}

catch (Exception e)

{

Console.WriteLine(e);

errors++;

}

if (errors > 10)

{

break;

}

Console.WriteLine(

string.Format("TotalsInLoopReader: {0} completed, {1} failed, duration {2}", currentIndex, errors,

DateTime.Now.Subtract(startedAt)));

int incorrectTotalsCounter = 0;

for (int i = 0; i < currentIndex; i++)

{

if (numbers[i] != 10000)

{

Console.WriteLine(string.Format("TotalsInLoopReader Iteration {0}, incorrect totals :{1}", i, numbers[i]));

incorrectTotalsCounter++;

}

Console.WriteLine(

string.Format("TotalsInLoopReader: {0} incorrect totals out of {1}", incorrectTotalsCounter,

currentIndex

));

}

The tables that store users and tickets and the stored procedure which counts the tickets are as follows:

CREATE TABLE Data.Users(UserId INT NOT NULL,

FirstName VARCHAR(8) NOT NULL,

LastName VARCHAR(8) NOT NULL,

Position VARCHAR(20) NOT NULL,

LotsOfComments CHAR(5000) NOT NULL,

CONSTRAINT PK_Users PRIMARY KEY(UserId))

CREATE TABLE Data.Tickets(TicketId INT NOT NULL,

AssignedTo INT NOT NULL,

Priority VARCHAR(10) NOT NULL,

Status VARCHAR(10) NOT NULL,

LotsOfComments CHAR(5000) NOT NULL,

CONSTRAINT PK_Tickets PRIMARY KEY(TicketId),

CONSTRAINT FK_Tickets_Users FOREIGN KEY(AssignedTo) REFERENCES Data.Users(UserId)

)

CREATE INDEX Tickets_AssignedTo ON Data.Tickets(AssignedTo)

CREATE PROCEDURE Readers.SelectTotalTickets

SELECT COUNT(*) AS cnt

FROM Data.Tickets

Note that it uses the default READ COMMITTED isolation level. Although there are exactly 10000 tickets in the system at any time, the totals in many cases will be slightly different. In fact, when I run the test, the totals are usually incorrect in about 50% cases :

1	TotalsInLoopReader: 1098 incorrect totals out of 2000

The full test harness is provided in TotalsTest.cs.

As you have seen, the default READ COMMITTED isolation level does not guarantee correct totals, but why is that? Is it some kind of bug in SQL Server? In fact, no, this is not a bug, SQL Server works as designed. The explanation of incorrect totals is very simple: as the select scans the index on AssignedTo column, it does not keep locks on index pages after it has finished reading them. So, if a ticket is reassigned, the corresponding index entry can be moved to another index page. As such, it can be either read twice or not read at all. Note that in this case the totals can both slightly exceed or be slightly less than 10000.

Trying out REPEATABLE READ isolation level.

Let us up the isolation level of the selects to REPEATABLE READ:

ALTER PROCEDURE Readers.SelectTotalTickets

SET TRANSACTION ISOLATION LEVEL REPEATABLE READ

SELECT COUNT(*) AS cnt

FROM Data.Tickets

Rerun the stress test. This time the number of incorrect totals has decreased and in this case all incorrect totals are less than 10000. There are no incorrect totals exceeding 10000 this time. The reason is simple: to ensure that the reads are repeatable, this isolation level may prevent an index entry from moving to another index page. This means that a reassigned ticket will not be counted twice. However, this isolation level does not prevent index entries that were not counted yet from moving to an already scanned page.

Trying out SERIALIZABLE isolation level.

Let us up the isolation level to SERIALIZABLE:

ALTER PROCEDURE Readers.SelectTotalTickets

SET TRANSACTION ISOLATION LEVEL SERIALIZABLE

SELECT COUNT(*) AS cnt

FROM Data.Tickets

Rerun the stress test. This time all the totals are always correct, but there are so many deadlocks (on my laptop up to 50% of selects are deadlock victims) that the price for 100% correctness of results is in many cases prohibitive.

Using snapshot isolation.

Let us make sure that snapshot isolation is enabled, as follows:

ALTER DATABASE StressTests

SET ALLOW_SNAPSHOT_ISOLATION ON

When your selects run under snapshot isolation level, as follows:

ALTER PROCEDURE Readers.SelectTotalTickets

SET TRANSACTION ISOLATION LEVEL SNAPSHOT

SELECT COUNT(*) AS cnt

FROM Data.Tickets

These selects and modifications do not block each other. As a result, all your selects always return correct results, and there are no deadlocks involving a reader and a writer. Also because there is no lock waiting, the stress test completes significantly faster. Rerun the stress test and see for yourself.

Using dirty reads aka READ UNCOMMITTED isolation level.

Run the following SQL script and rerun the stress test:

ALTER PROCEDURE Readers.SelectTotalTickets

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED

SELECT COUNT(*) AS cnt

FROM Data.Tickets

Note that your selects run faster and you are getting less incorrect totals than when you were using the default READ COMMITTED isolation level. The explanation is straightforward: selects running under READ UNCOMMITTED isolation level acquire less locks. As such, they run faster, so the chance of an index entry counted twice or not counted at all is smaller.

Conclusion

Solid understanding of concurrency is essential in database programming. I think that experimenting with concurrently running queries is a very good way to acquire such understanding. In this article I have demonstrated why selects may work perfectly without concurrency yet intermittently fail in real life multi user environments.

As you have seen, in this particular case the best choice is clear: the query running under snapshot isolation always returns correct results, and it never fails due to deadlocks. Although the best choice in your particular situation may be different, the choice of isolation level may be very important. Do not make this choice lightly – make sure that you verify your choice against a realistic amount of data, and perform enough realistic concurrency tests.

Register for Simple Talk

The Case of the Skewed Totals

Prerequisites.

Reproducing a select returning incorrect totals.

Trying out REPEATABLE READ isolation level.

Trying out SERIALIZABLE isolation level.

Using snapshot isolation.

Using dirty reads aka READ UNCOMMITTED isolation level.

Conclusion

Article tags

About the author

Alex Kuznetsov

Alex Kuznetsov's contributions

Articles

Books

Top topics

Alex Kuznetsov's latest contributions:

Reusing T-SQL Code

Painless Refactoring of SQL Server Database Objects

Lessons Learned from Six Years of Agile Database Development

Prerequisites.

Reproducing a select returning incorrect totals.

Trying out REPEATABLE READ isolation level.

Trying out SERIALIZABLE isolation level.

Using snapshot isolation.

Using dirty reads aka READ UNCOMMITTED isolation level.

Conclusion

Article tags

Recommended

About the author

Alex Kuznetsov

Alex Kuznetsov's contributions

Articles

Books

Top topics

Alex Kuznetsov's latest contributions:

Reusing T-SQL Code

Painless Refactoring of SQL Server Database Objects

Lessons Learned from Six Years of Agile Database Development