One of the foundations of the internet is the end-to-end principle as described by Saltzer, J. H., D. P. Reed, and D. D. Clark (1981) in End-to-End Arguments in System Design.
This pretty much says queue mirroring as a reliability mechanism is a waste of time. You might argue it’s time the RabbitMQ team has spent for you, but that’s not really true: using queue mirroring requires you to design a safe network environment for it to run in, and diligently train your operations staff not to take certain catastrophic courses of action when trouble shooting.
An uncontroversial counter argument is that the originator of a transaction cannot be available to participate in a re-try: I.e. it’s not connected to the internet for the duration of the transaction. That’s a rare scenario these days, and getting rarer.
A counter argument that’s quite often used is that it will complicate the application, and then the developer will make an error. Let’s consider that in the context of a RabbitMQ application where there is exactly one recipient:
We are going to compare using queue mirroring with durable queues to using end to end acknowledgement and transient resources.
First I’m going to point out what you need to do to achieve reliable delivery using queue mirroring.
When an application publishes a message using AMQP 0.9, there is no explicit acknowledgement of the receipt of the message. The closest you get to this is if you close a connection without error, all the messages published during the connection are implicitly acknowledged. A client application could do something very complicated to leverage this: taking great care to capture negative acknowledgements like connection exceptions and additionally cycling through connections periodically, committing transactions which were published during the connection, unless they were previously explicitly NACKed. How did the designers of AMQP 0.9 get this so wrong? Even then, this does not guarantee a message has been mirrored (or even persisted, or reached a queue on another node in the cluster). Thankfully, RabbitMQ adds publish ACKs. Publish ACKs simplify the implementation of transactions which send messages – allowing a transaction to be committed on receipt of the Publish ACK. Careful though: committing a transaction is generally a blocking operation, so you need to delegate it to a worker (or however you deal with that in your platform). A publish ACK is not issued until mirroring is complete.
Wait a second! what if something stops the transaction commit? We will get to that later…
There needs to be a map between message Ids and transactions, and this map needs to be invalidated if the connection drops.
Just to summarise – reliable implementations must use publisher acknowledgements, which require the implementation of a callback which completes transactions.
What about the receiving end? The consumer doesn’t send an ACK until it commits a transaction. OK? Not quite: we might fail to send the ACK. Also, if the publisher doesn’t get or process the Publish ACK, it might re-send the message. Either way, we might get the same message twice, so we must code for repeated deliveries – I.e. be idempotent.
Instead of using publish ACK, my publisher creates an acknowledgement queue, and sets reply to on it’s outgoing messages. The receiver no longer sends an ACK – we use NO-ACK: it simply sends a reply to the publisher. The publisher consumes from it’s acknowledgement queue, and on receiving messages, completes transactions.
Notice how the contracts for the producer and consumer are unchanged from using ACK: There’s an acknowledgement function, which sends a message, rather than calling the basic ack function. This function does not need special code to handle connection drops any more – it can use any connection to acknowledge. There is an acknowledgement callback for the publisher, and this is simpler too – it doesn’t have to keep track of the relationship between message id’s and transactions.
Also notice how in this scheme it’s not necessary to do anything but reconnect, and re-create any transient resources when the connection drops: the acknowledgement pipeline is unaffected, even if it uses transient resources: Re-tries will result in acknowledgements regardless. The fact that you can use transient resources means you can use clustering without mirroring to achieve high availability, and avoid most of the pitfalls.
So in conclusion: it’s actually easier to write a reliable application using re-tries than it is to use RabbitMQs reliability mechanisms. It’s much more reliable, and has simpler operational procedures.
Next year I’ll cover the possibility of multiple recipients, and some scenarios in which it’s equally simple to implement reliability without queue mirroring.