Transactions

Comments 0

Share to social media

By now, pretty much everyone has heard the stories about the first computers. They were huge, cost a fortune, required incredible amounts of air conditioning, and perhaps more importantly, ran batch jobs submitted on punched cards and magnetic tape. But even before we had these dinosaurs, there were E.A.M (electric accounting machine) units. They could run one and only one job at a time and you could not share data. One of the major steps in the evolution of computing was sharing data among applications at the same time. Random-access processing wasn’t possible until we got disk-based storage systems.

Just consider a simple business application in which we use employee timecards. We need an application to compute each employee’s hours worked and apply a pay schedule. It will probably be sorted by an employee identifier or tax number. But we also have budgets for various projects. This would require grouping the same information from the timecards by some kind of project identifier. In the very early days of computing there was a lot of scheduling user access to the computer. For example, something really important like the payroll would have precise timeslots allocated to run it. Nobody just casually walked into the computer center with a program they like to play with and get access to company information.

Multitasking had not been invented yet. The model for processing was basically a simple job queue. In theory, a simple queue will always finish working. It might take a long time, but since each job is independent of the others, there’s no conflict over resources. For example, only one job at a time had access to the printer, so you didn’t have to worry about two or more users trying to print at the same time.

In theory, we could get deadlocks in which Process Alpha wants a resource like a printer, and so does Process Beta. One process could wait until the other finishes; but let’s assume that both processes require two or more resources at the same time. Process Alpha locks up resource number one and waits for resource number two. At the same time, process Beta locks up resource number two and waits for resource number one. This is called a deadlock. There were various rules for resolving them, but they will all involve process Alpha or process Beta releasing its hold on the resources which are in conflicts.

The Sybase family allows an administrator, the DBA, to kill one of the processes so that the other processes could get to the resource. Other schemes prioritized the processes and the system resolves conflicts.

ACID

ACID is a cute acronym for Atomicity, Consistency, Isolation, and Durability. This is a set of properties we want in transactions in a database. The original work was done by the late Jim Gray, but this acronym is due to Andreas Reuter and Theo Härder at IBM. These four properties as defined by Reuter and Härder are as follows:

Atomicity

Transactions are often composed of multiple statements. Atomicity guarantees that each transaction is treated as a single “unit of work”, which either succeeds completely or fails completely: If the entire transaction fails is the database is left unchanged. But this also means that other processes in the database can’t find out exactly what’s happening in a particular transaction until it completes or fails.

The ANSI/ISO Standard PSM language is the model for PL/SQL from Oracle, and it is like SQL Server’s T-SQL, Informix, and a few other proprietary procedural SQL extensions. One feature that SQL/PSM has is the BEGIN ATOMIC.. END; block. This statement groups everything inside it into an atomic block of code. These languages also have the usual Algol family BEGIN..END; blocks, which are more related to flow control.

Consistency

Consistency means that when the transaction is done, the database has gone from one consistent state to another consistent state. Remember that SQL is not like a file system. We have CHECK() constraints (table and schema level), keys, referential integrity, cascades, triggers, data type validation and a bunch of other stuff. This is why I’ve long been telling people that the DDL is where most of the work in SQL actually gets done.

It is not easy to check consistency, this may be the most important part of a database. If the hundreds of procedures in a real-world application all had to do all of consistency checking every time, the overhead be like? By putting it in the DDL, we can guarantee that things are always done, and done one way, one time in one place.

Isolation

One of the major advantages of databases over filesystems is that transactions can you be executed concurrently (e.g., multiple transactions reading and writing to a table at the same time). This means we need a “traffic cop” to handle them.

Broadly speaking, we have pessimistic and optimistic concurrency controls. Pessimistic control models generally use “locks” on the tables, which allow only one user at a time to work on a table. It’s called pessimistic because there’s an assumption there’s going to be problems.

Optimistic concurrency control usually uses versioning. The system keeps the old version of the data and the new version of the database. The new version will replace the old version, if there’s no conflicts. This model comes from microfilm systems in the 1970’s. Back then, documents were stored on microfilm to save space. When two users wanted the same document, you printed out a hard copy and sent one to each of the users. If both users agreed on updates, and the update was made on the master microfilm. If there were any conflicts or disagreements, then a supervisor had to decide which (if either) of the updates was valid, or if both updates could be made without conflicts. Optimistic models are built on the assumption that conflict is rare, not assumed, and is to be handled as an exception when it is detected.

Univac briefly experimented with logical concurrency controls. The system looks at incoming transactions and analyzes them to see if they are logically independent of each other. If the transactions do not affect each other, then there is no need for locking or versioning. This was basically an automation of old microfilm systems.

Durability

Durability guarantees that once a transaction has been committed, it will remain committed even in the case of a system failure (e.g., power outage or crash). This usually means that completed transactions (or their effects) are recorded in non-volatile storage.

If you want to put the work you’ve been doing into persistent storage, the command is COMMIT [WORK] ; I’ve never seen anyone actually use the optional [WORK] keyword. If you want to throw out what you been working on, the command is ROLLBACK [WORK].

SQL Server works on a model called “auto commit”, which automatically commits after each statement that would change the database. This is why the terminating semi-colon required by standard SQL and lots of other products was not needed in T-SQL.

When the database job has multiple statements, doing a ROLLBACK can be quite extensive. Think about what you have to save to undo the effect of an ALTER statement, bulk insertion and so forth. UGH!

To help ease some of this pain, many products have a SAVEPOINT feature. This is something like a bookmark. The statement syntax is “SAVEPOINT <savepoint name>” after group of other statements. When you decide that you didn’t want to do those statements, you use “ROLLBACK [WORK] [to <savepoint name>]” to back out the work. It’s important to remember that the work is rolled back only to the named SAVEPOINT; all the work done before the save point is still awaiting its final disposition.

In practice, users will save (n) rows of data, check to see if it that that much work was okay, then take another bite. This kind of coding depends a lot on how your particular SQL engine works, so you might want to experiment with it a little bit.

Problems in Traffic Control

The bad news is that SQL products allow “dirty reads” from the data. You’ll see the situation called non-repeatable or non-reproducible reads.

Assume I go into a table. The process returns the current status of that table. I do some things to that table in my session. In the same session, I then go back to this table. But what if another user messes with my table during this time? Do I see the data the way I left it, or do I see the data the way the other user left it? When the change adds or removes data in a table and we see it, it’s called a phantom read; when changes to the data are seen when you try to use what you hope would be the same data again in the same session, this is called a nonrepeatable read; if one user overwrites an update made by a second user, then we have a lost update.

Isolation Levels

Isolation levels define which of these phenomena we are willing to tolerate. The syntax is “SET TRANSACTION ISOLATION LEVEL <isolation level>

The classic levels from strongest to weakest are

SERIALIZABLE: the users are all separated from each other. Effectively, each user thinks he’s the only one in the database. If you’re doing some structural changes to the schema, this is what you want to use. As I said before, this is the default in the ANSI/ISO standards.

REPEATABLE READ: during this session, the user sees one view the data until he changes it. This is usually good enough for transactions that are going to do reporting, but not a lot of updating and inserting.

READ COMMITTED: during this session, the user sees any work in the database which has been committed. This level is when you’re actually doing updating and inserting on the database.

READ UNCOMMITTED: during this session, the user sees any work in progress in the database. And now we’re back to the philosophical question as to whether or not data that was in process and rolled back without being committed, actually ever existed.

Jim Melton wrote a paper many years ago on other isolation levels. These other half isolation levels could get quite tricky. What do you want to do if an update to the database effectively leaves the database the same? What if individual rows are changed, but aggregates remain the same?

Conclusion

In practice, you really need to know how your product behaves. It is impossible to simultaneously achieve high availability, consistency and serializability. Transactions may be less likely to deadlock or abort due to conflicts under weaker isolation levels, if you can live with the consistency anomalies. To use a standard phrase when giving advice to general questions on database programming – – it all depends.

Load comments

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.