Ninja Immutable Databases

'Immutable' databases operate under the principle that data or objects should not be modified after they are created. Once again they hold the promise of providing strong consistency combined with horizontal read scalability, and built-in caching. Are Immutable databases a new idea? Are they different in any way from the mainstream RDBMSs.

Beware of bugs in the above code; I have only proved it correct, not tried it.
— Donald Knuth/1977 “Regression testing”? What’s that? If it compiles, it is good; if it boots up, it is perfect.
— Linus Torvalds/1998

Immutability is the paradigm of the month, for the last few years; even to the extent that some wonder whether such paradigm has advantage in Relational Database-management systems (RDBMS) and, of course, the non-relational sorts of databases.

Let’s start with a bit of fabulist narrative. (In this text, as in all that I produce, words mean what I say they mean, most significantly coder means one who writes code and developer is one who builds databases; others often use those words as synonyms for those writing code.)

“The tri-monthly meeting of the ‘Old Wine in New Bottles Society of the City of London’ will please come to order.  As per our standard agenda, we begin with readings from the quantitative scripture and proceed with the meeting’s questions to be addressed.  Forthwith with veneration: the questions for the members:”

  1. Do you have computed columns in your tables?
  2. Do you have summary tables in your schema?
  3. Do you believe that the applications’ world is the domain of coders or database developers?
  4. Do you believe that F#, clojure, haskell, or some other functional language will take over the world because of their foundation in immutable data?
  5. Do you believe that it’s possible to define a programming language such that: if it compiles, it runs (correctly)?
  6. Do you believe that immutability can be, and should be, the foundation of a relational database? That is, would an immutable database be the equivalent of: if it compiles, it runs? In what way would it differ, and be superior to, current RDBMS?

Answers: A) yes, B) yes, C) database, D) no, E) no, F) no

The questions relate to the use, misuse, and abuse of the concept of immutability in code and data. The influence of Flavour Of The Month (FOTM) functional languages which offer, in some sense, immutable data and a promise of (nearly?) bug-free code from the compiler.

The answers, this being a fablulous narrative, are those chosen by a wise database developer.

The Promise

Functional languages hold out the promise that, since they’re based on no side-effects, local immutable data, easy to reason about (which isn’t defined), then we can finally throw off the chains of our GUI debuggers and be free at last. Immutability of data grew out of the issue of multi-threaded code: what to do about shared data. Just write the code, the compiler tells us what’s wrong, and a clean compile yields a program that works. What could be better? For those writing engines, could be. For developers working with SQL and Declarative Referential Integrity (DRI) and procedures and such, the notion of an immutable database must sound …(sigh)… odd. A (relational) database is about changing data: All the time. Wouldn’t the same paradigm of immutable data yield so much superior databases?

Well, maybe. At about the same time that Dr. Codd issued his first paper, a language that was not only functional but also logical, was being developed. It was (and still is) Prolog. In fact, it was to be so spectacular, that it was the basis for “Fifth Generation Computing”. It faded into oblivion. Prolog’s data structure looks a bit like relational tables, and a language called Datalog developed from it (this is the language used by Datomic, below). These days, you’d need a pack of bloodhounds to find much trace of it. But never fear; functional languages have come to the rescue. John A De Goes, in his article ‘Functional Programming Isn’t the Answer’ writes …

This code is practically correct by definition. It’s trivial to understand (if you know Haskell’s syntax), and there’s not a lot of sorting code that’s easier to maintain…

He goes on to disavow that functional languages are the end of The Yellow Brick Road, but, well, may be they are. You know.

What Problem does an Immutable Database Solve?

In sum: what is the currently insoluble problem: the problem  for which only immutable datastores, that is one with immutable rows in its base tables, can provide an answer? There is none. Full stop.

The hope is that immutable databases offers deliverance from the need for transaction control. It provides from the engine what Multiversion concurrency control (MCC or MVCC) brings to the session: the Wiki meme; anyone can write without concern for others.

Current, serious, relational databases are, and have been for decades, founded on immutable data. It’s just that their designers and developers were wise enough not to expose the immutability in a routine fashion to either coders, developers, or users. If you’re one of those apostates who answered coder to question C), then any sort of database will suit, since, for you the “database” is just a heap of bytes which you intend to manage with your code; whatever storage yields the most code is the preferred outcome.

As a senior member of the ‘Old Wine in New Bottles Society of the City of London’, Andrew Clarke, has observed, “The Babylonian accounting tablets were baked accidentally, otherwise we wouldn’t even be aware of their existence. However, once values were pressed into the clay, it would have been hard to alter them.”

Alas, the senior member is somewhat mistaken. “These latter tablets were originally unfired, as they were meant to be erased and reused. Temple accounting records, on the other hand, were fired and stored for future reference.” (The Library of Congress, here.) The point being, of course, that clay tablets, in their time, were erasable (until the transaction was “committed”), not immutable at the instant of writing. Embezzlement was a supported operation, if you got there in time (sort of as it is today?). Kind of like a NAND based SSD of today: one can write bytes continually into a block (tablet), but to *change* a byte one must erase the block (tablet) and start over. Old wine in a new bottle, once again. 5,000 years, and we’re still stuck in batch mode.

The proposal at hand: an immutable database (in some sense, append-only at the base table level [aside: must such a database prohibit foreign keys?]) solves no known problem, in particular it provides no superior, serial accuracy of stored data, and has been the standard database technology for some years in any case.

Immutability harnessed in the RDBMS

All serious relational databases are based on the write-ahead logging process, in which transactional data is written to a separate log store in serial/sequential fashion, and once durability of the write is confirmed, the engine then writes the update to the base tables. Transactional data, per se, is immutable in this paradigm, it’s just not exposed to the user, and does not constitute base tables. The common purpose of the log is crash recovery (see, ARIES). Over the years, various database engine developers have complained about certain operating system, file system, and storage devices which lie. And with good reason. “… most disk drives have caches. Some are write-through while some are write-back, and the same concerns about data loss exist for write-back drive caches as for disk controller caches.” here A SQL Server view is here. Is immutability even possible in the face of lying hardware?

Databases offer two types of logging (names vary): circular and archive, where the former has some maximum file space which is reused when full, while the latter saves all transactions (and assumes that sufficient file space will always be available). Circular logged databases do impose a discipline on developers: a hanging transaction can cause the log to remain open causing space from completed transactions to be unavailable, and thus exhaust the log file; a stall, freeze, or crash will be the result (depends on the engine). Archive logging is an infinite append-only database, just not directly exposed to the user (coder/developer) in standard circumstances. The problem with archive logging is that hanging transactions can go un-noticed for a very long time. And, for what it’s worth, I’ve worked on GL-centric verticals where deletes are faux: user actioned deletes just set a flag in the row that says “I am deleted”; of course, that does mean that the SQL and procs have to check the flag. Physical deletes were an admin task initiated by the business folks who said, “we don’t need 3 FY-ago rows; wipe ’em”. Was I working on an immutable database? Perhaps so from the point of view of application code, all while run inside conventional SQL databases.

These applications treated the accounting modules (at least) as immutable repositories of transactions, and visible to users. How far back in time immutability holds was a decision made by management; two years was typical. Some of these verticals had defined summary tables, for convenience.

The Revival of the Immutable Database

How, then, did the notion of immutable database arise? I put it up to the FOTM functional languages. The canonical piece is “Accountants Don’t Use Erasers”. Turns out, they really do. Accountants do use erasers — closing the month often means wiping the registers/journals/ledgers for the month and posting the aggregates. Note that the context of the piece quickly moves from RDBMS/accounting concerns to distributed/replicated datastores. The basis for such argument is that “classic” distributed RDBMS tech, federated/distributed databases on two-phase commit is too difficult. Let’s roll our own file based approach. As if the semantics of distributed transactions go away. Sure. And sure, again.

There is another issue which bears discussion. If we were able to hop in the WABAC Machine, to 1977-ish, we would find proto-Oracle founded. At this time there is nascent System R at IBM and Oracle. IBM goes with “traditional” locking (they should, they pretty much invented it, see the Historical Notes in chapter 7 of Gray and Reuter), while Oracle adopts Multi-Version Concurrency Control (MVCC). Decades of conflict ensue. MVCC is relevant here, in that it implements a notion of immutability at the session level: each session sees an individual, unchanging, outside world. Only its changes are visible to it, and the engine does this by keeping individual histories of viewed data as of the instant the session queries. One might think of this as immutability PDA.

Multi-Version Concurrency Control

Oracle, Postgres, and (sorta, kinda, some times) SQL Server implement MVCC. I don’t prefer MVCC databases. Going back years, the mantra has been, “Oracle will eat your entire server”. In order to avoid ORA-01555, “snapshot too old”. The reason being that MVCC concurrency requires this soupcon of immutability, at the client session level, and consumes prodigious amounts of memory and file storage to do so. Ditto Postgres, although each uses a different method to store the snapshot data; Oracle uses separate files while Postgres keeps required rows in the base table. Thus, having your cake and eating it, too, is expensive. But it does support lazy transaction scoping.

The opium of MVCC smokers is that, since readers can operate without conflict with writers (at the instant of the read), then conflict disappears. Of course not, it just gets postponed.

Oddly, the constant meme within the coding community is “fail fast”, which is what traditional row-level locking gives you. With MVCC, collision/failure happens after the reader has decided to update a row, and then application code must take over. In the real world, and in databases as well, the rule is that the last writer wins. So long as the database constraints aren’t violated, that’s the way the world should work.

With MVCC, if the reader changes columnA based on the value of columnB, but columnB was changed (to a value which makes the columnA’s new value inconsistent in the real world) by some writer after the read what to do? If the invalid value in columnA is seen by a constraint, then the engine will abort the update. Otherwise, the application code has to check that columnA and columnB still make sense, and that involves another round trip to the database.

Under the usual circumstances, the bad value of columnA will win. Now, if application databases were specified to Organic Normal Formâ¢, then intra-row check constraints wouldn’t be needed, of course. The current crop of NoSql style “databases” are flat-file repositories. Yuck. In sum, then, local immutability provided by MVCC supplies little more than a false sense of security. Should we expect real security if the entire datastore is based on immutability?

The big three immutable style databases are: Datomic, RethinkDB, and CouchDB. Baron Schwartz wrote this last December, and he was reading my mind. Further, here’s some nice diagrams showing how append-only B-trees actually work. Looks suspiciously like IMS, to me. In a nutshell, none of the three is now an immutable datastore, in any practical way. Surprise. All that transaction stuff that really smart folks have been grappling with over the last 35 years? Replacing it with Google Summer of Code project doesn’t work out so well. You can either have data under transactional control, or you can have anarchy; take your pick. Mutability of storage doesn’t have any real impact on this semantic.

To Conclude …

To return: what problem is an immutable database proposing to cure? Is it Consistency, Availability and Partition-tolerance (CAP)? Is it a two-phase commit work-around? Is it merely a notion of data integrity?

Is CAP truly relevant? CAP is an issue particular to distributed datastores not under the control of a Transaction Processing Manager (TPM). Which is to say, file storage for coders. Brewer admits to error. What about two-phase commit? DB2 implemented distributed (federation) by 1993. Oracle, SQL Server, and even Postgres now support it in varying ways. One need not go down the path of file storage.

Does it make sense to have an engine for an immutable database? I think not. Developers are fully capable, with existing RDBMS, of implementing immutability of data in whatever tables justify it. They’ve been doing so for decades. As we’ve seen, MVCC resource burden in providing session level immutability hasn’t been eliminated by contemporary hardware, and current immutable databases have had to admit the impracticality of universal immutability of table data.

When I was an undergraduate, there was popular economics book, “TANSTAAFL” which posited that the free lunch wasn’t really free. The promised free lunch of immutable databases  isn’t free, and it isn’t new. Is it nourishing? Who knows: the proof of the pudding is in the eating.

References