‘Why is this so hard?’ is a recurring theme of my RabbitMQ consulting at the moment. If a database gets a split brain, I just ask it to reconcile, and it mostly works. Why can’t RabbitMQ do that?
Here is an attempt to explain that in fairly concrete terms.
Imagine a simple system where a message corresponds to a CRUD operation on database row. The messages must be delivered in order: if two updates for the same row are re-ordered, for example, then that row will end up with the wrong value.
The message server doesn’t understand the content of messages. As a result, as far as it’s concerned, re-ordering is never allowed: it might be dangerous. This means that the message server thinks queues on either side of a partition are consistent if they contain exactly the same messages in exactly the same order, and an inconsistent queue has no value. RabbitMQ doesn’t even special case consistency: you still reset the nodes on one side of the partition, because in that case you are just deleting identical copies.
Our simple application has only one queue. After a partition, the two versions of the queue might be consistent, in which case you keep 100% of the data. Otherwise you get to pick one of the versions. That’s pretty useless: which messages are missing?
In contrast, a database returning after a partition can inspect the transaction log on each side of the partition. It calculates the set of data read and written by each transaction log, and then the intersection of these two data sets. That intersection might well be empty. If it isn’t, that data is locked and any query which would return it fails, until it’s resolved manually. Actually we can do better than that in various ways I’m leaving out of scope, but this is sufficient for our CRUD example. Databases will also do row level conflict resolution for you at the cost of lower transaction resolution, which you can use if you are brave/foolhardy. This is actually safe for CRUD too. Obviously CRUD is itself rarely safe or useful itself: it’s just an example I don’t need to explain.
This is possible because the database understands the subject of each transaction: the subject is the set of rows read and written by the transaction. In our CRUD application, that’s always exactly one row.
The result is an available, partition tolerant system, which after partition can divide itself into the good part (the consistent part) and the bad part, and the good part keeps going (although it needs a bit of down time to make this transition). How useful this is depends on the relative size of the two parts, but generally it’s very useful.
What if the message server understood the ‘subject’ of the message too?
The message server would be able to tackle merging partitioned queues like this: The queue can be divided up into many distinct queues, one for each subject. We can then check the consistency of these queues. All the consistent queues can then be merged into the new consistent queue and delivered. The remaining messages are set aside for manual processing, and the subjects of those queues are black listed: further messages will be set aside for manual processing while the subject remains black listed.
Vector clocks! I hear someone cry. Well, this requires some sort of consensus mechanism. I keep encountering sites that have two data centres separated by a WAN, which rules that out. Anyway, even if that’s not the problem, you still actually need to implement it. This system places roughly the same burden on the implementer as using a database.
Anyway, implementing the above is probably something that can be achieved without modifying RabbitMQ or the application, by writing a merge tool. By disabling publication while it runs, you can keep the implementation fairly simple. Consumers will still need to be idempotent, because where the queues disagree about where the head of the queue is. A RabbitMQ extension could manage access to the queues automatically during recovery, which might allow less downtime, but I’m not sure how possible it is.
An alternative approach is to try to design a CRDT (Conflict-free replicated data type). Then you can do something much simpler: construct a new queue by concatenating the two queues from either side of the partition, optionally de-duplicating. Take care to prove you really have a CRDT: it’s not that onerous: You must define a function that merges two records from different nodes, which is associative, commutative, and idempotent. Even if you are going to implement your CRDT using operations, the function must still exist.
Even in our CRUD example, you could attach a time based UUID to each operation, and the merge function chooses updates which are later. Last write wins, in other words. Of course it would sometimes defy the principal of least astonishment, occasionally leaving the record with an unexpected value, just like re-ordering, or row based merging of a database after partition.
It’s often possible to apply a policy or heuristic to updates which occur close together, or just keep duplicate data. For example a customer might provide a phone number on your website, and at the same time as a third party provides you with the customer’s phone number. In this case, keep both phone numbers, as long as they are different.
This is where you start to get a pay-off: The merge function deals with distributed data in general – where data arrives from multiple sources in real time, rather than just dealing with partitions in your own system. Once you need this merge function, getting it to fall back on time based UUIDs is a trivial complication. In a customer service based company, even data arriving from a single customer arrives asynchronously: from mobile apps as they go on and off air, by email, over the phone…
Obviously as you extend your merge function, proving you still have a CRDT is an ongoing cost, so it’s important to keep it simple: just a few well known patterns. Otherwise, unless you have very high performance requirements, three data centres and a consensus based system with strong serialisability will save you a lot of development time.