RabbitMQ throughput: assigning blame

By: on January 31, 2016

Recently I looked at the instrumentation API to see what I could figure out about allocation performance. Actually, there’s a bunch of information available via erlang:system_info:

 (rabbit@mrclumsy)2> erlang:system_info({allocator,binary_alloc}).

It’s far too long to include here, but it at least confirms the allocator is the best fit allocator. Erlang also promises that coalescing is eager, and O(1): see erts_alloc.

It’s not easy to see how this could result in really pathological behaviour, but lets try varying the message size randomly anyway: it’s just something a test should do.

I left my random message size PerfTest running all night, with 2 consumers:

$ sh runjava.sh com.rabbitmq.examples.PerfTest -c 4000 -f persistent -s 4096 -q 200 -y 2

The test is aiming for 4K messages/s. The situation in the morning is unexpected

Consumer 1 looks like this:


The second queue has 36 million messages in it:


The consumers are identical: they are Java threads running in the same Java process, so it seems likely the culprit is RabbitMQ. The RabbitMQ version is 3.2.4, which is quite old, but flow control problems don’t often turn out to be Rabbit’s fault, and I’m pretty sure it is this time, so it’s a chance to play with some diagnostic tools.

What is really weird is that this problem only applies to the 1 or 2 consumer cases: once I start 3 or 4 consumers, the problem goes away.

I’m going to look at this with Wireshark. Wireshark understands AMQP, so it’s pretty useful. I’m using version 2, which I’ve had to build from source, because Ubuntu doesn’t have it yet. Version 1 doesn’t notice all AMQP deliveries, so it’s not much use for statistical analysis.

Start Wireshark before you start the test. Wait till the problem establishes (a few seconds), then look in the consumer list for each queue and note down the port numbers. In my case they are 33432 – the good one, and 33433 – the bad one. You can filter in wireshark with ‘tcp.port=33433’, or ‘tcp.port=33433 and amqp’ to just see amqp traffic (we will probably need to look at tcp acknowledgements at some point).

There is an obvious difference: for the ‘good; connection, delivers and acks are mostly interleaved, while for the bad connection, there are long sequences of acks and deliveries.

I’m going to try to match acks to subsequent deliveries and graph the resulting latency. I export from wireshark: Filter so you can only see one stream. File -> Export packet dissections -> As plain text.  In the dialog, choose ‘Displayed’ rather than captured, and uncheck ‘Details’. Save one file for each stream. Here’s a little python script: clatency.py to process the log, and produce some plot points for latency and deficit:

I saved my files as consumer-2.1.txt and consumer-2.2.txt. Process them like this:

    $ python clatency.py consumer-2.1.txt > consumer-2.1.tsv
    $ python clatency.py consumer-2.2.txt > consumer-2.2.tsv

I’m using ctioga2 to generate some graphs:

    $ ctioga2 --text-separator "\t" -x "time (s)" -y "deficit" consumer-2.1.tsv@'2:5' consumer-2.2.tsv@'2:5' --png 400x300 -o deficit
    $ ctioga2 --text-separator "\t" -x "time (s)" -y "latency (s)" consumer-2.1.tsv@'2:3' consumer-2.2.tsv@'2:3' --png 400x300 -o latency

Here’s the result:


I guess I haven’t quite run the test long enough to show the ‘good’ connection is in a steady state, but after running the first test I’m pretty sure it is. The second graph is more or less a proxy for queue size, because the publish rate is constant and the queue never drains completely.

The latency graph is interesting: look how the slow connection oscillates, as well as sharing the big spikes with the healthy connection. What global event causes these big spikes? It’s worth pointing out the latency spikes and the jumps in deficit don’t line up. They could be the JVM gc, I guess.

This still doesn’t tell me who is responsible: TCP flow control can mean either party is ready to send, but nonetheless packets aren’t sent. Very occasionally, a single packet contains multiple deliveries or acks, and that almost certainly indicates flow control.

The other thing is this: there’s a sort of typo in the command above: Where I meant to set the producer rate to 4000 I actually set the confirm pool very high: The producer will pause if the number of unconfirmed messages goes beyond this. If I actually limit the rate, lower the confirm pool to 200, or disable publish confirms, the problem goes away. So basically it seams there’s a scheduling bug you can tickle by never starving RabbitMQs connection process.

Happily, an upgrade also fixes the problem, and seems to result in double the persistence performance: getting 9K messages and 36MB per second throughput on my commodity SSD. Here’s a consumer graph with the latest RabbitMQ:


This investigation was really meant to result in ‘Bloated Rabbit part 2’, but if I run the test I originally intended, I doesn’t seem to upset Erlang’s binary allocator in the slightest.

It’s much more promising to use the data set from a rare case where RabbitMQ is to blame for a flow control issue to find a signature in network traffic that might identify other similar problems. Before I can do that, I need to track if the TCP connection itself is idle (e.g. when the transmit window contracts). There must be some prior art for that, but I’m yet to find it.



Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>