Switch and Network Adapter Fault Tolerance
Each of the VMware ESX hosts that we had were equipped with dual Network Adapters (NICs). With a typical physical server, two NICs could demonstrate fault tolerance. However, for ESX hosts the dual NIC is not fault tolerant. VMware ESX has three major types of traffic:
-
VMkernel – used for vMotion, which allows host downtime without an interruption of service
-
Service Console – initiates vMotion, serves as the primary venue of managing Virtual Machines
-
VM Traffic – each individual virtual machine’s traffic, e.g., a web server’s incoming /outgoing requests
Our previous setup had significant points of failure:
-
If one NIC failed there would be complete service interruption.
-
If VMNIC0 failed, the interruption is slightly more controllable as it could be scheduled even though the virtual machines would not be manageable.
-
If VMNIC1 failed, all virtual machine traffic would be unexpectedly interrupted.
-
-
If the physical switch failed there would be an unscheduled complete service interruption.
Our old VMware Networking setup, no NIC/switch redundancy
The plan was to purchase an additional quad port NIC to remedy the NIC redundancy issue. With the additional 4 ports, the three types of traffic could have two dedicated ports. This setup is fault tolerant for NICs with each of the three traffic types. It’s taking the Cisco ESX Hosts with 4 NICs (page 62) to the next level. To fix the switch redundancy problem, the NICs are evenly split across two switches, per traffic type.
Networking setup demonstrating switch/NIC redundancy
Networking Performance
With the previous setup, we had an average of 7.5 VM’s going through one NIC per host. The NICs were being overused. Users were complaining of slow network performance. We implemented NIC teaming, while being afforded failover with favorable results.
Cacti Diagram showing before and after on an ESX host
In the above pic, a drastic decrease in utilization is shown after Week 32. This demonstrates the decrease in stress on the NIC that was being over utilized.