RabbitMQ’s new timestamp-tracking feature

By: on August 24, 2016

‘Rest assured our service uses a queue with guaranteed delivery, meaning that your message will always get through. Maybe not this year, but some time, definitely.’ Not quite sold? To help address the requirement for more practical service-level guarantees, we added a feature to RabbitMQ 3.6 called timestamp tracking. Tracking messages via their age has proven to be a simple and effective means of monitoring services end-to-end via the queuing infrastructure. Unlike commonly-monitored statistics such as queue depth, it provide a clear and unambiguous indication of downstream processing problems.

End-to-end monitoring

In my year or so looking after RabbitMQ in a bank we never had a problem with RabbitMQ itself. There were a number of problems with attached applications though, and these would usually manifest themselves as consumers getting stuck, looping or simply falling behind. The queue backlog, and other statistics such as nacking rates and the number of connected consumers, became our means of detecting problems in downstream systems.

But we found it difficult to pin down suitable thresholds: how many backlogged messages are too many? 10,000? 5? We had one application that would regularly neglect to process its one message a day, another that slowly worked its way through a batch of a million or so set up for it at the opening of business. Spotting anomalies here was more like trend analysis – inevitably any simple scheme was prone to false alarms.

What we needed to do was to take a step back and recognise that there’s really only one key performance indicator: service time. Everything else – the number of messages, connections, nacks etc. – is just circumstantial evidence. If a consumer is down for a while, or even rejects a message repeatedly, that may be fine – as long as it recovers and services the message within its time limit.

Service time tracking with head_message_timestamp

Enabling service time tracking is actually quite simple: all that’s needed is for Rabbit to track the oldest message in the queue. And since it’s normally FIFO, this is just the timestamp of the first message. We therefore decided to make this value – if present – visible as another queue statistic. It’s called head_message_timestamp and appears as the raw Unix epoch value.

One subtle feature is that, where messages are being acknowledged, merely consuming the head message doesn’t make it disappear – it has to be processed and acknowledged successfully. This ensures that problems aren’t hidden by consumers taking and then ultimately rejecting messages. (I came across this snag when building something similar using JMS’s queue browser).

Setting this value in clients is also straightforward, except maybe in C# where getting a Unix timestamp is a bit involved:

    var message = "hello";
    var props = channel.CreateBasicProperties();
    props.ContentType = "text/plain";
    props.DeliveryMode = 2;
    var time = DateTime.UtcNow;
    props.Timestamp = new AmqpTimestamp(Convert.ToInt64((time.Subtract(new DateTime(1970, 1, 1))).TotalSeconds));
    channel.BasicPublish("orders", "", props, message);
    Console.WriteLine("New Order: {0} at time {1}", order, time.ToString("f"));

Alternatively, there is a now a timestamping plugin that will automatically add the property to messages as they are published.

Alerting

The head_message_timestamp statistic provides the raw material for generating alerts via a monitoring system. A simple monitor could be a small shell script that polls the queue statistics:

#!/bin/bash
MONITORING_INTERVAL=15
MAX_MESSAGE_WAIT_TIME=120
ADMIN_EMAIL=

while :
do
    sudo rabbitmqctl -q list_queues name head_message_timestamp | while read QNAME QHEAD_MESSAGE_TIMESTAMP
    do
        if [ -z $QHEAD_MESSAGE_TIMESTAMP ]
        then
            echo "Queue $QNAME: head message has no timestamp"
        else
            echo "Queue $QNAME: head message timestamp `date -d @$QHEAD_MESSAGE_TIMESTAMP`"
            if (( `date +%s` > QHEAD_MESSAGE_TIMESTAMP + MAX_MESSAGE_WAIT_TIME )); then
                MSG="Queue $QNAME: message waiting since `date -d @$QHEAD_MESSAGE_TIMESTAMP`"
                echo "===> Alert: $MSG"
                if [ -n "$ADMIN_EMAIL" ]; then
                    mail -s "RabbitMQ Consumer Alert: $MSG" $ADMIN_EMAIL
                fi
            fi
        fi
    done
    sleep $MONITORING_INTERVAL
done

To try this out, I’ve created an exchange Orders and queues Orders.Billing and Orders.Fulfilment, then set a message using a timestamp value given by date +%s.

rmq2

rmq3

Running the script of course produces the normal output, and after a while, the alerts will appear:

$ ./monitor.sh
Queue Orders.Billing: head message timestamp Wed Aug 24 14:48:21 BST 2016
Queue Orders.Fulfilment: head message timestamp Wed Aug 24 14:48:21 BST 2016

...

Queue Orders.Billing: head message timestamp Wed Aug 24 14:48:21 BST 2016
===> Alert: Queue Orders.Billing: message waiting since Wed Aug 24 14:48:21 BST 2016
Queue Orders.Fulfilment: head message timestamp Wed Aug 24 14:48:21 BST 2016
===> Alert: Queue Orders.Fulfilment: message waiting since Wed Aug 24 14:48:21 BST 2016

Nagios + check_mk

My colleague Tom Parker created a sophisticated monitor using Clojure to feed a RabbitMQ monitoring event stream to Riemann. The distinctly lower-tech version presented here uses a check_mk script in Python for Nagios. Given that we’re monitoring the health of downstream applications as much as Rabbit itself, we need to get metrics at the per-queue level. (As far as I know, the other Nagios scripts for Rabbit only monitor a fixed set of top-level resources). As there may be a large number of queues the use of something like check_mk is essential for obtaining metrics efficiently in a single poll.

The current check_mk script is a bit skeletal but a similar one was created for the bank and has been running for a few years now. That version also loads a table of per-queue thresholds and ops. contacts, so allowing different thresholds and alert recipients to be defined.

The script is a conventional local check so should be installed where the check_mk agent expects – usually in /usr/lib/check_mk_agent/local. Running it directly should give output like the following (yes, my hostname is misshelpful):

./check_rabbitmq.py 
2 RMQ_misshelpful_Q_Orders.Billing consumers=0|msgs=1|rate_in=1.2624160518e-77|rate_out=0.0|unacked=0|wait_time=1324.23335791;30;120 CRIT - wait_time is 0:22:04.233358
2 RMQ_misshelpful_Q_Orders.Fulfilment consumers=0|msgs=1|rate_in=1.41556516763e-77|rate_out=0.0|unacked=0|wait_time=1324.23338509;30;120 CRIT - wait_time is 0:22:04.233385

These per service (i.e. per queue) reports are then digested by Nagios and alerts raised.

One advanced Nagios feature (!) is that when a queue goes back to normal the alert will be reset and no manual cancelling is needed.

A normal Nagios display has enough room in the status information for stats such as queue size. This is replaced by the warning or critical descriptions when a limit is reached.

Situation normalSituation normal
Queue warningsrmq6
Gone critical!rmq5

Conclusion

Obviously I recommend using this metric – in fact I’ve often wondered how peace-of-mind in operations teams can be achieved without it. You could argue that it would be better to closely monitor each downstream application but in my experience it’s surprisingly reliable and cost-effective to do this kind of thing via the messaging infrastructure.

There are more sophisticated tools or dashboards available for tracking business processes. However, these are somewhat detached from the infrastructure and will lack precise knowledge of what’s in a queue. Also, if this level of per-message tracking is required, there are options available directly in Rabbit using the event plugin or something like the auditing plugin we developed for a recent client. This isn’t to say that queues will always be better than a fully integrated BPM system of course.

Acknowledgements

Thanks to Michael Bridgen, Ashley Hewson, Michael Klishin and Alvaro Videla for reviewing and improving the rather small bit of RabbitMQ code involved.

Share

3 Comments

  1. Mark says:

    Getting the Unix timestamp in C# is easy.

    DateTimeOffset.Now.ToUnixTimeSeconds()
  2. Kenneth says:

    I was looking for a way to get notified of old messages in our RabbitMQ queues, and copied your approach to it. Thanks!

  3. Santiago Vasquez says:

    good stuff, thanks

Leave a Reply

Your email address will not be published.

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*