There’s a good talk given by  Gabe Westmaas at the HK OpenStack Summit:

The talk describes what Rackspace monitors in the public cloud OpenStack deployment, how responses are handled, and some of the integration points that are used.  I recommend watching it for OpenStack specific monitoring and a little context around this post.

In this post I am going to discuss how the sausage gets made - how the underlying Nagios configuration is managed.

Some background: We have 3 classes of Nagios servers.

  1. Global - monitors global control plane nodes (e.g., glance-api, nova-api, nova-cells, cell nagios)
  2. Cell - monitors cell control plane nodes, and individual clusters of data plane nodes (e.g., compute nodes/hypervisors)
  3. Mixed - smaller environments - these are a combined cell/global

With Puppet, the Nagios node’s class is based on hostname, then the Nagios install/config puppet module is applied.

The Nagios puppet setup is pretty simple. It performs basic installation and configuration of Nagios along with pulling in a git repository of Nagios config files. The puppet modules/manifests change rarely, but the Nagios configuration itself has to change relatively frequently.

Types of changes to the Nagios configuration:

  1. Systems Lifecycle - normal bulk add/remove of service/host definitions. These are generated with some automation, currently a combination of Ansible and Python scripts which reach into other inventory systems.
  2. Gap Filling - as a result of RCAs or other efforts, gaps in the current monitoring configuration are identified. After the gap is identified, we need to ensure it is fully remediated in all existing datacenters and all new spin ups.
  3. Comestics/Tweaking - we perform analytics on our monitoring to prioritize/identify opportunities to automate remediation and/or deep dive into root causes. We have a logster parser running on each Nagios node which sends what/when/where on alerts to StatsD/Graphite.  Toward the analytics effort, we sometimes make changes to give all services more machine readable names.  We also tune monitoring thresholds for services that are too chatty or not chatty enough.

Changes #2 and #3  were drivers to put Nagios configuration files into a single repository.  Without a single repository, the en masse changes were cumbersome and didn’t get made. The configuration repository is laid out like this:

  • Shared configurations are stored in a common folder, each of which has a corresponding subfolder for the Nagios node class.
  • Service/Host definitions are stored in folders relative to their environments
  • All datacenters/environments are stored within the environments folder

The entire repository is cloned onto the Nagios node, and parts of which are copied and/or symlinked into /etc/nagios3/conf.d/ based on the Nagios node class and the environment.

For example:

  • nagios01.c0001.test.com: nagios class is cell (c0001 in the hostname), environment is test/c0001
  • /etc/nagios3/conf.d/ gets cfg files from the common/cell folder in the config repo
  • environments/test/c0001 is symlinked to  /etc/nagios3/conf.d/c0001/

This setup has been working well for us in production. It’s enabling first responders and engineers to make more meaningful changes faster to the monitoring stack at Rackspace.