Monitoring Edge Node Network Configuration

Over the last few months I’ve done a bit of work around monitoring, Open vSwitch, and XenServer. This post lists some of the networking/Open vSwitch specific items to monitor on hypervisors. Link Status: Nagios SNMP Interfaces plugin works well for reporting a failed link as well as reporting error rates and inbound/outbound bandwidth. Open vSwitch Manager and Controller Status: Transport Node Status is a quick and dirty python script which can be used with extended SNMP to alert when OVS loses a connection to a manager/controller....

July 28, 2014 · itsahill00

On Failure

A couple of interesting research papers around failure, found in The Datacenter as a Computer. Failure Trends in a Large Disk Drive Population (2007) Out of all failed drives, over 56% of them have no count in any of the four strong SMART signals, namely scan errors, reallocation count, offline reallocation, and probational count. In other words, models based only on those signals can never predict more than half of the failed drives....

May 9, 2014 · itsahill00

On Working Remote

In late March I relocated from San Antonio, TX to Lexington, KY. Same awesome job just with a twist…REMOTE WORK! I am mainly collaborating via IRC, tmux, 1:1/M:M TeamSpeak, and 1:1/M:M video conferencing. My takeaways after the first month: Obvious, but it took me a while: When you’re talking, LOOK AT THE CAMERA - multiple monitors and a MacBook make this a little awkward Quality matters: appropriate microphones & video, particularly conference rooms, make a huge difference....

April 29, 2014 · itsahill00

Required Reading: The Datacenter as a Computer

Several Google Employees published The Datacenter as a Computer. It’s freely available. The text covers a broad area, from monitoring to cost modelling for power but it’s quite digestable. At a light ~130 pages it’s an easy read. The bibliography is worth digging into if you want deeper dives on specific topics.

April 3, 2014 · itsahill00

Upgrading Open vSwitch

Operating Open vSwitch brings a new set of challenges. One of those challenges is managing Open vSwitch itself and making sure you’re up to date with performance and stability fixes. For example, in late 2013 there were significant performance improvements with the release of 1.11 ( flow wildcarding!) and in the 2.x series there are even more improvements coming. This means everyone running those old versions of OVS (I’m looking at you, <=1....

March 28, 2014 · itsahill00

StatsD and multiple metrics

Measure all the things! Graphite & statsd are my weapons of choice. One set of metrics in particular that we wanted to measure are the various TCP stats, including TCP Retransmit rate. We crafted a Python script to send all of the metrics in a single UDP packet and hit a weird scenario. The python script was all ready to roll except that StatsD was only logging one metric. All of the metric packets were arriving at the StatsD instance, but only one was being processed....

February 13, 2014 · itsahill00

Deep Dive: OpenStack Retrieving Nova Instance Console URLs with XVP and XenAPI/XenServer

This post is a deep dive into what happens in Nova (and where in the code) when a console URL is retrieved via the nova API for a Nova configuration backed by XVP and XenServer/XenAPI. Hopefully the methods used in Nova’s code will not change over time, and this guide will remain good starting point. Example nova client call: [code]nova get-vnc-console [uuid] xvpvnc[/code] And the call returns: +--------+-------------------------------------------------------------------------------------------------------+ | Type | Url | +--------+-------------------------------------------------------------------------------------------------------+ | xvpvnc | https://URL:PORT/console?...

February 11, 2014 · itsahill00

The Host Network Stack

This post is a collection of useful articles/videos that I’ve collected about networking on XenServer and Linux. XenServer Xen Network Throughput and Performance Guide (Technical Overview) XenServer: Under the Hood < Specifically device -> PIF -> network -> VIF relationship Linux (video) Through the Ether and Back Again < discusses python and the Linux Sockets API How SKBs work Queueing in the Linux Network Stack Linux Advanced Routing & Traffic Control HOWTO Linux Device Drivers 3rd Edition < specifically chapter 17 As you can see, there are a multitude of elements to consider when looking into host networking issues for a Linux VM running on XenServer (which is Linux underneath the covers anyway)....

February 5, 2014 · itsahill00

Managing Nagios Configurations

There’s a good talk given by Gabe Westmaas at the HK OpenStack Summit: The talk describes what Rackspace monitors in the public cloud OpenStack deployment, how responses are handled, and some of the integration points that are used. I recommend watching it for OpenStack specific monitoring and a little context around this post. In this post I am going to discuss how the sausage gets made - how the underlying Nagios configuration is managed....

January 22, 2014 · itsahill00

Determining Enabled VLANs from SNMP with Python

Similar to this thread, I wanted to see what VLANs were allowed for a trunked port as reported by SNMP with Python. With the help of a couple of colleagues, I made some progress. [code language=“python”] vlan_value = ‘000000000020000000000000000000000000200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000’ for key,value in enumerate(format(int(vlan_value, 16), “0100b”).rjust(len(vlan_value) * 4, ‘0’)): … if value == ‘1’: … print key … … … 42 146 [/code] Convert the string returned to Hex Convert that to Binary Right fill 0s to the appropriate length to give offset (determined by the size of the string) Loop through the resulting value and each character that is a 1 is an enabled VLAN on the port In conjunction with LLDP, I’m able to query each switch/port and interface is connected to and determine if the VLANs are set properly on the port....

December 13, 2013 · itsahill00