In-Memory OLTP: Row Visibility in SQL Server's MVCC

Introduction

In the multi-valued concurrency control (MVCC) model within the in-memory OLTP engine, a row going through updates may have multiple versions in memory simultaneously. Rows inserted and deleted in pre- and post-commit states are also present. Transactions running concurrently and serially may access different row versions at once, depending on rules, toward correct—or sometimes incorrect—outcomes.

In this article I’ll try to give you a solid understanding of what row versions a transaction may see under various conditions. We’ll start with groundwork concepts and an access rule for in-memory rows. Then I’ll expose a timing issue inherent in the model. With this understanding, we can explore further concepts and constructs that fit together to make row visibility consistent in the MVCC.

Row Visibility: Basics

In-memory rows are structured differently than those in disk-based tables. In lieu of the page structure and its need for logical locking and physical latching, and other differences, each row has a metadata header and payload:

The payload comprises the actual table columns and their values for the row.

I’ve left out most of the header fields to focus on the two directly involved in row visibility. Both Begin Timestamp (Begin-Ts) and End Timestamp (End-Ts) are monotonically increasing database-wide numbers that serialize the time at which transactions that create and delete rows commit their work. I’ll refer to row versions only by their transaction serial numbers: <Begin-Ts, End-Ts>; the payload does not play a role in visibility.

We’ll soon see where these numbers come from, but for now let’s try an example. I’ll also use the terms row and row version interchangeably throughout.

<10, ∞>

<31, 240>

The upper row was inserted by a transaction that committed at timestamp 10, and the infinity symbol for the end timestamp means that no transaction has deleted this row; this row version is currently in the table. The lower row was inserted by a transaction that committed at a later point in time—at transaction serial time 31—and was deleted by a transaction having commit timestamp 240.

The payload in an in-memory row can never change. To update column values, the End-Ts for the current row version (remember, End-Ts is ∞) is marked with the transaction’s commit timestamp, and a new row with the updated payload columns is inserted with Begin-Ts having the same commit timestamp and infinity for the End-Ts. These two operations are done in one atomic step.

<31467, ∞>

<20000, 31467>

<10, 20000>

In this sample, if all three rows have the same primary key—which, in in-memory tables, are immutable —then the lowest row would be the row’s insertion, and those above it resulted from two updates in separate transactions committing at timestamps 20000 and 31467 respectively. If the keys are different, then the bottom row was inserted by an early transaction, then deleted by a later transaction, which also inserted the middle row, and the top row was inserted by a transaction with the latest commit timestamp, and this transaction also deleted the middle row.

The samples imply that although the in-memory OLTP engine works with INSERT, UPDATE, and DELETE modification statements, they reduce to operations insert <C-Ts, ∞> and (logical) delete <original C-Ts, C-Ts>, where C-Ts is the commit timestamp of the transaction doing the work.

Timestamps Begin-Ts and End-Ts are gotten from a counter of type bigint known as the Global Transaction Timestamp (GTTs). Its value at any moment was the one given to the latest transaction that issued a commit. (For implicit and explicit transactions, the COMMIT TRAN statement is part of the code, and for single-statement autocommit and atomic block statements used with natively compiled procedures, it is written in automatically.)

Here are its rules:

The counter value is persisted through instance restarts to maintain transaction serial order and correct row version visibility.
At execution start, the transaction is assigned the GTTs current value—the logical read time. This determines its single point-in-time row version visibility. Any number of concurrent transactions can have the same read timestamp.
At commit, the transaction is assigned the post-incremented GTTs value—the commit timestamp—making the number always one or more higher than any currently active transaction’s logical read time. This value is written into Begin-Ts and End-Ts fields in row headers as was discussed. Commit timestamps are unique across all committing transactions.

We now have enough information to understand row visibility.span style=”mso-spacerun:yes”> The <Begin-Ts, End-Ts> pair is the validity interval. A transaction may access a row version if its position in the transaction serial order is at least as great as that of the transaction that created the row, and less than the transaction that deleted it, if any; i.e. its order falls within the validity interval of the row:

Begin-Ts <= logical read time < End-Ts.

In the three-row sample above, validity intervals for all row versions are disjoint, so a transaction with logical read time of at least 10 could see exactly one of the rows. By contrast, given intervals <50, 80>, <60, ∞>, a transaction with logical read time from 60 to 79 could see both rows (and the versions can’t share the same primary key).

A Timing Issue

These concepts are sufficient for row visibility—but only in a static database. In-memory OLTP is meant to increase performance in (highly) transactional systems, though, so let’s look at behavior of transactions running together.

For the examples, rows x and y have these begin and end timestamps in their current versions before the transactions start:

X: <10, ∞>

Y: <676, ∞>

I’ll use a modified schedule representation to show interleaving of operations:

When transaction i–txn-i for short—commits, say it gets timestamp 12345; these are the new row versions:

X: <10, 12345>

—————–

Y: <12345, ∞>

<676, 12345>

If no other transactions commit between the time txn-i commits and txn-k starts, the latter will get the logical read time 12345; else it will be greater. In either case, by the formula txn-k reads the newly inserted y row version having interval <12345, ∞> and x is not visible.

Now let’s interleave the transactions’ operations:

If txn-i gets the same commit timestamp, then this time the logical read time for txn-k must be less than 12345. In this case, for the y row it reads the deleted version, <676, 12345>. It had already read the x version whose header values were <10, ∞>. If it were to re-read the row later, it would access the same version, now with interval <10, 12345>. Which is also correct:

All rows [re-]read by a transaction remain consistent as of a single point in time—at transaction start. Providing consistent read sets (snapshots) is the rationale for row versioning.

Now imagine that in either schedule txn-k reads both x and y before txn-i commits; for example:

Because txn-i doesn’t have a commit timestamp while it inserts and deletes rows, affected Begin-Ts and End-Ts values are unknown when txn-k reads the rows, which it can do immediately in an optimistic, non-blocking system. The basic row visibility formula cannot be applied.

Even the original schedules can have this problem. Though txn-i does have the commit timestamp, there is a lag time, as we’ll see, before the in-memory OLTP engine can write commit timestamps to headers. In this case, if we imagine txn-i affecting thousands of rows, txn-k can read some headers with complete interval values and some having unknown values. To make it worse, txn-i can issue a rollback or be aborted in a post-commit phase.

When validity intervals are incomplete, how does a reader know which schedule is in effect and what action to take?

Two Puzzle Pieces

When a transaction starts, it gets the timestamp value of the last transaction issuing a commit from the GTTs counter, which is its logical read time. It also gets, from the in-memory OLTP engine, a unique, serial transaction identifier. For a transaction k, I’ll refer to it as txn-k ID. The identifier counter is of type bigint, incrementing for each new transaction involving in-memory structures. Unlike the GTTs, it isn’t persisted between instance startups (because there is no need to identify non-existent transactions).

Our second puzzle piece is the lifecycle of an MVCC transaction. At the 30,000 feet elevation, imagine it as having three phases:

The Active Phase

The Active phase starts with the transaction, and ends when the COMMIT (or ROLLBACK) command is reached.

As each insert and delete operation is performed during the Active phase, affected rows immediately receive the transaction’s unique identifier in the appropriate header timestamp fields, along with a bit flag signaling that the value is an identifier not a timestamp. This is a significant placeholder as we’ll see—the unknown from the previous section—that will later be replaced by the commit timestamp upon successful completion.

For row x <10, ∞> consider this schedule and timeline for reader txn-k and writer txn-i:

The timeline is meant to show txn-k starting before txn-i, at the same time, or later but before txn-i reaches the COMMIT command. Here are the versions of x after update:

<txn-i ID, ∞>

<10, txn-i iD>

If txn-i successfully commits, then its commit timestamp must be greater than the logical read time for txn-k. Therefore, in each read, txn-k accesses the same original row version: first with interval <10, ∞>, then <10, txn-i ID>, then <10, C-Ts> if the commit timestamp replaces the transaction identifier in time, or <10, txn-i ID> again if not. Were txn-i to rollback or fail, internal code will reset the End-Ts to ∞ in the original row. In either case, reader txn-k is unaffected by concurrently running writer txn-i.

Let’s add this rule to our basic visibility rule:

For txn-k and txn-i: txn-k can read <C-Ts, txn-i ID> when C-Ts <= txn-k’s logical read time and their Active phases overlap.

Then this must also be true:

txn-k cannot read <txn-i ID, ∞> when their Active phases overlap.

Logical read time for txn-k must be before the newly inserted row’s validity interval when txn-i commits. If the writer rolls back or fails for any reason, row <txn-i ID, ∞> will be inaccessible—marked for garbage collection.

The Validate Phase

Now let’s consider row visibility when readers start during writers’ Validate phase. We adjust the concurrency timeline thus:

Reader txn-k starts on or after txn-i’s commit statement but before its final resolution point, and ends anytime afterward. Before we can examine visibility, though, we need a little understanding of the Validate phase.

The in-memory OLTP engine performs these actions during validation: check for transaction isolation level violations; wait for ‘commit dependencies’ to clear—which I’ll explain in a moment; and write modifications to the log upon successful outcome (depending on table durability; not covered). But any of these steps can fail, although rarely. In this case the engine rolls back the transaction and marks it ABORTED; else, it marks it COMMITTED. Therefore, when txn-k reads a row version inserted by txn-i before txn-i reaches the final committed/aborted point, although the commit timestamp is available, the row’s status is unknown and the reader takes a commit dependency on the writer.

If the writer aborts, the reader has essentially done an unsuccessful dirty read. Row versions with validity interval <C-Ts, txn-i ID> and logical read time for txn-k >= C-Ts should have been visible but weren’t because the engine will put back the infinity value where the transaction ID is. When the engine rolls back txn-i it must also roll back txn-k. It will also make rows with intervals <txn-i ID, ∞> invisible to readers by marking them for garbage collection.

Rows visible to txn-k instead have intervals <txn-i ID, ∞> because txn-k’s logical read time is >= txn-i’s commit timestamp and the engine optimistically assumes txn-i will be committed. After txn-i is marked committed, the commit dependency that reader txn-k had on it is released. If txn-k is in its own Validate phase before txn-i resolves, it may wait, however briefly, for the commit dependency to clear.

To make a rule of this:

For txn-k and txn-i: txn-k can read <txn-i ID, ∞> with commit dependency when it starts during txn-i’s Validate phase

Certainly:

txn-k cannot read <C-Ts, txn-i ID> when it starts during txn-i’s Validate phase

Commit dependencies form chains of two or more transactions; think of it as directed acyclic graphs. The acyclic part means no deadlocking. There is no deadlocking in in-memory OLTP. Period.

The Resolve Phase

In the final phase, reader txn-k starts on or after Txn-i’s Validate phase finishes with the COMMITTED or ABORTED result, and ends anytime afterward.

The way I’ve structured the lifecycle of a transaction, resolution begins when the transaction is committed or aborted, and (if committed) it can start releasing commit dependencies held by readers. It may be more common to think of the release of commit dependencies as ending the validation phase; I’ve placed it here to show that a reader transaction starting in my Resolve phase doesn’t take out a commit dependency on the writer because the writer’s outcome is known and was made permanent. Only the order of operations is important: first write to log, then mark transaction as committed, then release commit dependencies.

For txn-k and txn-i: txn-k can read <txn-i ID, ∞> when it starts during txn-i’s Resolve phase and txn-i’s state is committed

Also:

txn-k can read <C-Ts, txn-i ID> when it starts during txn-i’s Resolve phase and txn-k’s logical read time >= C-Ts and txn-i’s state is aborted

After releasing commit dependencies, the engine replaces the transaction IDs in row header fields with commit timestamps and clears the transaction ID bit flags. (You may know this final operation as being in the post-processing or commit processing phase.)

The Third Interlocking Puzzle Piece

This model is sufficient to ensure that transactions always read the right row versions, and when failure happens, that the in-memory OLTP engine can take corrective action. To fulfill the implementation, the third puzzle piece is no surprise: the engine must have metadata to track logical read times, commit timestamps and transaction states. The data is in a table known as the global transaction table (GTTbl), and we can access it through a dynamic management view (DMV), sys.dm_db_xtp_transactions. Table 1 shows sample readout with most columns elided:

The column meanings are as follows:

xtp_transaction. The unique, increasing serial number assigned to the transaction. I’ve been referring to it in examples as txn-i ID. This value is also unique in the DMV.
begin_tsn. Begin transaction serial number. The logical read time of the transaction.
end_tsn. End transaction serial number. The commit timestamp. This is 0 until the COMMIT statement is reached and resets to 0 if the transaction aborts. Replaces the xtp_transaction value in Begin-Ts and End-Ts fields during the Resolve phase.
state_desc. This can be ACTIVE, VALIDATING, COMMITTED, and ABORTED.
result_desc. Some results are as shown above; see the books online documentation for more.

When a transaction comes to a row that otherwise meets the constraint conditions of its query or modification statement and has a transaction ID in either of the Begin-Ts or End-Ts header fields, it finds the row writer’s transaction in the GTTbl by matching the transaction ID to the unique xtp_transaction column. It then compares this transaction’s end_tsn to its own logical read time and takes into account the state column in deciding if the row is visible. The decision, of course, is based upon the rules we’ve developed.

For example, referring back to Schedule 4 and discussion, txn-k’s first read[x] accesses row version <10, ∞> without need of the GTTbl. After txn-i updates x, the second read[x] finds versions <10, txn-i iD> and <txn-i ID, ∞> and so matches the ID to the corresponding xtp_transaction column in the GTTbl to get txn-i’s row. Because the state_desc column shows ACTIVE, read[x] yields the (same) <10, txn-i iD> version. When txn-i commits and txn-k does the third read, there is a timing issue: either of the row interval IDs may not yet have been replaced with the C-Ts. In this case, txn-i’s state_desc is VALIDATING or COMMITTED (or ABORTED) and so the engine can use the end_tsn to pick the (same) visible row version.

Final Word

We’ve seen how begin and end timestamps in row metadata define the row’s validity interval and allow for row versioning. This along with transaction logical read times define visibility. However, by examining schedules we saw that validity intervals can be incomplete. The solution I used to deduce visibility centered on a three-part model of transaction processing. Finally, we saw a DMV that showed how SQL Server tracks transaction metadata to make the MVCC work.

This article was all conceptual; no T-SQL were harmed during its production. For row visibility explained with T-SQL samples and further information, see references next.

References

Delaney, K., SQL Server Internals: In-Memory OLTP Inside the SQL Server 2014 Hekaton Engine, Simple Talk Publishing, Cambridge, 2014, 215 p.

Korotkevich, D., Expert SQL Server In-Memory OLTP Revolutionizing OLTP Performance in SQL Server, APress, New York, 2015, 248 p.

Miranda, M., 2016, In-Memory OLTP – Part 3 Row Structure and Indexes, Simple Talk, https://www.simple-talk.com/sql/database-administration/in-memory-oltp—row-structure-and-indexes/ (August 16, 2016)

In-Memory OLTP (In-Memory Optimization), Microsoft Developer Network, https://msdn.microsoft.com/en-us/library/dn133186.aspx (August 16, 2016)

Register for Simple Talk

In-Memory OLTP: Row Visibility in SQL Server’s MVCC

Introduction