Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I agree with the author that the various levels of isolation, etc. within the current crop of SQL databases is a morass. I’ll point to some recent fine work by Martin Kleppmann (https://github.com/ept/hermitage) that explores the issue and shows how many systems fall short of serializable isolation in their default modes. (And sometimes in modes labeled “serializable”!) In his test three databases actually achieve full serializability: PostgreSQL, MS SQL Server, and FoundationDB.

But don’t give up on ACID yet! If can actually get real serializability, you have a very powerful model that is also very simple to reason about. Serializable isolation gives you a virtual “whole database lock” that lets you modify lots of things all over the database before you “unlock” it and hand the database to the next client. The amazing thing about MVCC combined with serializable isolation is that you get to use this "simplest possible" mental model even though hundreds or thousands of concurrent clients might be all hitting the database at the same time.



> The amazing thing about MVCC combined with serializable isolation is that you get to use this "simplest possible" mental model even though hundreds or thousands of concurrent clients might be all hitting the database at the same time.

I think you've proved the author's point better than he ever could. With serializable isolation, you don't have concurrent clients all hitting the same data at the same time...you've got locking/blocking. The two things you're trying to put together are mutually exclusive.

The whole point of the article is that when you're trying to write concurrently to a database, it will be necessarily complex. ACID databases attempt to sweep that complexity under the rug and create a simpler mental model. But those abstractions are leaky and bubble up to the surface in ways that are often unexpected or difficult to handle. This is especially true when you try to scale beyond a single machine...achieving serializable isolation whilst replicating between database nodes is all but impossible.

Most NoSQL databases take the approach of explicitly exposing the complexity to the application with the assumption that concurrency is an absolute requirement and the application will understand how to achieve consistency better than a generic data tier can.


> I think you've proved the author's point better than he ever could. With serializable isolation, you don't have concurrent clients all hitting the same data at the same time...you've got locking/blocking. The two things you're trying to put together are mutually exclusive.

Not so. With serialization as implemented by PostgreSQL it's perfectly possible for serializable transactions to avoid blocking/rollback on read/write conflicts as long as there's some reasonable serialization order. Some DBs (default config of SQL Server for example) will lock in this situation, but it's not a requirement of the model.

If you have lots of clients writing to the same data at the same time, yes, you will have lock waits/deadlocks. At that point, though, unless your updates are trivial (i.e. only ever hitting one lockable unit per logical operation) you've got to deal with what is likely some extremely complicated update logic all by yourself. And if your updates are trivial, well, you can probably structure your relational DB so that you only ever have lock waits, which is often okay. Sure, there are situations where it's not (highly contended counter, say), but that's certainly a small minority of situations.

I would argue that if a developer finds typical ACID semantics hard to understand, they almost certainly aren't prepared for managing eventual consistency.

edit: If I might add to this, a quote from the Google F1 paper:

"We also have a lot of experience with eventual consistency systems at Google. In all such systems, we find developers spend a significant fraction of their time building extremely complex and error-prone mechanisms to cope with eventual consistency and handle data that may be out of date. We think this is an unacceptable burden to place on developers and that consistency problems should be solved at the database level. Full transactional consistency is one of the most important properties of F1."


Hmm... Serializable isolation and concurrency go together fine and certainly don't require blocking or even locking when using MVCC and optimistic concurrency. For that matter serializable transactions and scalability are not incompatible either (though few NoSQL systems have tackled the problem yet).

I think the author's main point is that sub-serializable isolation levels are confusing and can lead to wrong behavior in subtle ways. I agree, but to me it sounds like the failing is in using too weak a consistency/isolation model, not too strong a one!


> Hmm... Serializable isolation and concurrency go together fine and certainly don't require blocking or even locking when using MVCC and optimistic concurrency. For that matter serializable transactions and scalability are not incompatible either (though few NoSQL systems have tackled the problem yet).

For non-conflicting operations, I agree entirely that serializable transactions can scale just fine. In fact, just about every serializable concurrency control algorithm (e.g., two-phase locking, partitioned OCC, MVCC with clever timestamp allocation) can scale just fine for non-conflicting operations.

However, for conflicting read-write operations, serializability will incur higher costs than many alternatives (e.g., eventual consistency, weak isolation implemented in a scalable manner). Serializing access to shared data items fundamentally requires some serial execution.

This fundamental overhead is a key reason why almost every commodity RDBMS defaults to one of these weaker isolation levels and why databases like Oracle don't support serializable isolation at all. It's not that their respective database architects don't know how to implement serializability -- it's that weak isolation is faster and, in many cases, reduces deadlocks and system-induced aborts, even if it increases programmer burden.


> Serializing access ... fundamentally requires some serial execution.

This is true, but I don't think this requirement limits scalability because the serial part can be arbitrarily cheap (e.g. approve ordered batches of work).

You also say that many databases could implement serializable transactions but don't because of the "higher costs" and that "weak isolation is slower". This sounds like a tradeoff to me so, of course, there will never be one right answer. Well, maybe someday for problems that permit I-confluence :)

However, the article attests to the high costs of sacrificing serializability in programmer productivity and system complexity. Those are serious downsides that need to be weighed very carefully against any actual, measured, performance advantages that are on the table.


> Serializing access to shared data items fundamentally requires some serial execution.

Sure, but typical MVCC systems that provide SI will serialize transactions in some cases anyway. For example, if two transactions try to update the same row in Postgres, the second will block until the first one commits/aborts. This is true under RC, RR, or SERIALIZABLE.

> This fundamental overhead is a key reason why almost every commodity RDBMS defaults to one of these weaker isolation levels and why databases like Oracle don't support serializable isolation at all.

I don't think that's true: if you already have an MVCC implementation that provides SI, the additional overhead of providing SSI is relatively modest. I'd say the main reason Oracle doesn't provide SERIALIZABLE is historical: their MVCC implementation predated Cahill's work by many years, and people literally didn't know how to use MVCC to provide SSI at the time.


> Serializable isolation and concurrency go together fine and certainly don't require blocking or even locking when using MVCC and optimistic concurrency

If you're willing to embed retry logic throughout your application, sure. But that's a steep price to pay if your application can handle an eventually-consistent state.


The retry logic has to be there anyway (you can't sacrifice partition tolerance).


They aren't mutually exclusive. You can't modify the same row concurrently and have serializability, but that isn't a common requirement. The common requirement is that you can have concurrent transactions operating on different pieces of data.


If you're not modifying the same data, then there's no need for ACID...AD will do just fine. The whole point of the CI is to govern what happens when there is contention for the same data. Brushing that occurrence aside and saying it's rare is asinine...it's going to happen and programmers need to account for it. With serializable isolation, you risk your application blocking (can be acceptable) or even deadlocking (never acceptable).


That probably came across wrong. What I meant was that the ability to do concurrent access is the important thing, not the performance of it. CPUs aren't any good at contended access to memory cells, but they still allow it and it works. If you want your system to scale to many threads you need to break the sharing.

Same with databases. The important thing is that it gives you strong consistency when there is contention. It'll never be as performant as operating on different rows concurrently though. When you touch different rows, you don't feel the pain of locking, which is the common case.


I agree and one thing that is missing from this discussion is that many document databases are atomic around a document update. That gives the developer a very easy model to work with for a lot of things, you can update any part of a document without conflicts. In a relational system that would probably require updates/reads from multiple tables. Of course it won't work for everything but you can get an easy model to understand.


That is true - but it's a commonly observed fact that it's rare for a nontrivial application to manage to limit every logical operation to single document updates.


You don't need to be using serializable isolation to face blocking or deadlocks. So-called "deadlocks" really aren't that bad when they are detected and all but one of the competing transactions are failed. You just catch the error and retry.


serializability != one giant lock around your database.

concurrency and serializability go together great. it's only if transactions actually read/write or write/write the same data that they are done one at a time (and even that's not exactly true, depending on the concurrency control algorithms). MVCC often provides better performance because read-only transactions will not block writers.

and achieving serializability across database nodes is DEFINITELY NOT impossible.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: