Operating OpenStack: Monitoring RabbitMQ

At the OpenStack Operators meetup the question was asked about monitoring issues that are related to RabbitMQ. Lots of OpenStack components use a message broker and the most commonly used one among operators is RabbitMQ. For this post I’m going to concentrate on Nova and a couple of scenarios I’ve seen in production.

It’s important to understand the flow of messages amongst the various components and break things down into a couple of categories:

Services which publish messages to queues (arrow pointing toward the queue in the diagram)
Services which consume messages from queues (arrow pointing out from the queue in the diagram)

It’s also good to understand what actually happens when a message is consumed. In most cases, the consumer of the queue is writing to a database.

An example would be for an instance reboot, the nova-api publishes a message to a compute node’s queue. The compute service running polls for messages, receives the reboot, sends the reboot to the virtualization layer, and updates the instance’s state to rebooting.

There are a couple of scenarios queue related issues manifest:

Everything’s broken - easy enough, rebuild or repair the RabbitMQ server. This post does not focus on this scenario because there is a considerable amount of material around hardening RabbitMQ in the OpenStack documentation.
Everything is slow and getting slower - this often points to a queue being published to at a greater rate than it can be consumed. This scenario is more nuanced, and requires an operator to know a couple of things: what queues are shared among many services and what are publish/consume rates during normal operations.
Some things are slow/not happening - some instance reboot requests go through, some do not. Generally speaking these operations are ’last mile’ operations that involve a change on the instance itself. This scenario is generally restricted to a single compute node, or possibly a cabinet of compute nodes.

Baselines are very valuble to have in scenarios 2 and 3 to compare normal operations to in terms of RabbitMQ queue size/consumption rate. Without a baseline, it’s difficult to know if the behavior is out of normal operating conditions.

There are a couple of tools that can help you out:

Diamond RabbitMQ collector ( code, docs)- Send useful metrics from RabbitMQ to graphite, requires the RabbitMQ management plugin
RabbitMQ HTTP API - This enables operators to retrieve specific queue statistics instead of a view into an entire RabbitMQ server.
Nagios Rabbit Compute Queues - This is a script used with Nagios to check specified compute queues which helps determine if operations to a specific compute may get stuck. This helps what I referred to earlier as scenario 3. Usually a bounce of the nova-compute service helps these. The script looks for a local config file which would allow access to the RabbitMQ management plugin. Example config file is in the gist.
For very real time/granular insight, run the following command on the RabbitMQ server:
- watch -n 0.5 ‘rabbitmqctl -p nova list_queues | sort -rnk2 |head’

Here is an example chart that can be produced with the RabbitMQ diamond collector which can be integrated into an operations dashboard:

Baseline monitoring of the RabbitMQ servers themselves isn’t enough. I recommend an approach that combines the following:

Using the RabbitMQ management plugin (required)
Nagios checks on specific queues (optional)
Diamond RabbitMQ collector to send data to Graphite
Dashboard combining RabbitMQ installations statistics