iSCSI SAN performance woes with VMware ESX 3.5

We filed support requests with IBM and VMware and went through a very lengthy process without any results.

Each of our hosts had the following iSCSI HBAs:

QLA4010
QLA4050C

A while ago we found out QLA4010 is not on the ESX 3.5 HCL even though it runs with a legacy driver.

As our virtual environment grew we noticed storage performance lagging. This was particularly evident with our Oracle 10G Database server running our staging instance of Banner Operational Data Store. We were seeing 1.1 MB/sec and slower for disk writes.

We opened a case with VMware support and later with IBM support. We provided lots of data to VMware and IBM while no one mentioned the unsupported HBA. No one at IBM mentioned it either. VMware support referred us to KB# 1006821 to test virtual machine storage I/O performance.

We ran HD Speed in a new VM mimicing the setup using RDM and using a dedicated LUN. Similar results. We ran HD Speed on the same RDM on a physical machine and got 45 MB/sec.

All of our hosts had an entry like this in the logs ( grep -i abort /var/log/vmkernel* | less)

vmkernel.36:Mon DD HH:ii:ss vmkernel: 29:02:31:16.863 cpu3:1061)LinSCSI: 3201: Abort failed for cmd with serial=541442, status=bad0001, retval=bad0001

Hundreds, if not thousands of these iSCSI aborts in the log files. We punted to IBM and they gave us the recommendation of running Host Utilities Kit. This optimizes HBA settings specific to IBM storage systems.

My recommendation ended up being two fold: Upgrade the ESX hosts because we were on an old build (95xxx) and replace the QLA4010 with a QLA4050C on each host.

Now that our ESX upgrade is complete we are seeing much better performance from our iSCSI storage.