Identifying and Resolving NetXen nx_nic (Qlogic) NIC Failures
I’ve had the worst luck with NetXen NICs, specifically the HP branded NC375 quad port 1GbE flavor, as they continuously seem to experience some sort of issue that causes them to drop traffic and reset. This post contains some information on the NIC and how to troubleshoot the issues that I’ve encountered and resolved.
Driver and Firmware Details
In vSphere, the NetXen NIC uses the “VMware ESX/ESXi 4.0 Driver CD for QLogic Intelligent Ethernet Adapters” driver suite. Here’s a link to the latest drivers (as of 8/16/2011) if you need them. The driver shows up as “nx_nic”.
To determine the driver and firmware version you’re using, SSH into the ESX host and issue this command:
ethtool -i vmnicX (where X represents the uplink number that is using a NetXen card).
The results should be similar to this:
driver: nx_nic version: 4.0.598 firmware-version: 4.0.534 bus-info: 0000:04:00.0
To upgrade the drivers, I’m going to assume you have VMware Update Manager (VUM). Download the ISO from the link above, extract it to a folder, and then upload the offline bundle zip file into your VUM repository through the vSphere Client. The host patch is added to your non-critical updates automagically. Scan your hosts for missing updates and then remediate.
Common Issues with NetXen Cards
I run a lot of different types of NICs, From Broadcom (bnx2 driver) to Intel (e1000e driver). I’ve not had any issues in years with these cards. I’m not sure who’s dropping the ball on the NetXen, but these cards just never seem to work 100% all the time.
One major issue that I’ve encountered that actually triggers an HA failover is the infamous transmit timeout. There are lots of results for people having this problem if you search Google. Most of the recommendations are to turn off TCP Segmentation Offload (TSO). However, the March edition of the nx_nic drivers don’t even support TSO, so that’s not the issue. To find out the stats on your card, issue this command:
ethtool -k vmnicX (where X represents the uplink number that is using a NetXen card).
The results when using version 4.0.585 of the nx_nic drivers are:
Offload parameters for vmnicX: Cannot get device tcp segmentation offload settings: Function not implemented Cannot get device udp large send offload settings: Function not implemented Cannot get device generic segmentation offload settings: Function not implemented rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off
As you can see, TSO is disabled. Additionally, the driver download page also states that TSO is not a functioning feature.
The issue you will see in the messages.log file of the ESX host affected is as follows. In my case the issue is on a card with vmnic0, 1, 2, and 3.
vmkernel: 5:16:51:21.296 cpu9:4443)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic0: transmit timed out vobd: 492682296716us: [vob.net.uplink.watchdog.timeout] Watchdog timeout occurred for uplink vmnic0. vmkernel: 5:16:51:22.907 cpu17:4458)IDT: 1565: 0x29 vmkernel: 5:16:51:22.907 cpu17:4458)IDT: 1634: <vmnic0> vmkernel: 5:16:51:22.908 cpu17:4458)IDT: 1565: 0x31 vmkernel: 5:16:51:22.908 cpu17:4458)IDT: 1634: <vmnic1> vmkernel: 5:16:51:22.909 cpu17:4458)IDT: 1565: 0x39 vmkernel: 5:16:51:22.909 cpu17:4458)IDT: 1634: <vmnic2> vmkernel: 5:16:51:22.910 cpu17:4458)IDT: 1565: 0x41 vmkernel: 5:16:51:22.910 cpu17:4458)IDT: 1634: <vmnic3> vmkernel: 5:16:51:22.911 cpu17:4458)<3>nx_nic[vmnic0]: Load stored FW vobd: 492700761357us: [esx.problem.net.vmnic.watchdog.reset] Uplink vmnic0 had recovered from a transient failure due to watchdog timeout.
From what I can tell, this basically means that the card puked somewhere at the driver level and reset. I have several hosts that use a pair of these cards (one on board, the other in a PCIe slot). I’ve had several occasions where the cards reset and HA no longer has an uplink to use for heartbeats, so it immediately goes into an isolation response – the advanced HA option of das.failuredetectiontime is rendered useless (I have my clusters set to 60 seconds). By default, HA shuts down guest VMs. The timeline is like this:
- Cards timeout and reset.
- HA immediately panics and the host goes into an isolation response.
- The VM guests are shut down (default behavior).
- As the VMs are shutting down, the card is reset and the uplink is restored.
- HA heartbeats are received once again.
- VMs are never powered on at another host location and remained off!
TSO in Upgraded Firmware and Drivers
As of the 4.0.598 build of the nx_nic drivers, TSO makes a comeback. As you can see from the results of my ethtool query, I do now have TSO and it is enabled. If this causes a problem in the future, I’ll disable it – however it was disabled previously (or rather, it was not available), so I don’t know how it could cause further problems to have it on.
Offload parameters for vmnic0: Cannot get device udp large send offload settings: Function not implemented Cannot get device generic segmentation offload settings: Function not implemented rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off
There was an announcement from HP on 8/11/2011 that a number of firmware updates are also available for the NetXen suite of hardware.
The article states clearly:
FIRMWARE UPGRADE REQUIRED to Avoid the Loss and Automatic Recovery of Ethernet Connectivity or Adapter Unresponsiveness Requiring a Server Reboot to Recover
To update your firmware on a VMware host, use the “Firmware Upgrade for the NC375i, NC375T, NC522m, and NC522SFP Adapters Under VMware” section, or get a hold of the NetXen firmware CD/ISO as I have and install over iLO.
Thank you to Ben Richter for discussing his issues with the NetXen cards and emailing me over this link!
Unless you are extremely confident in your network and NetXen cards, set your HA isolation response to “leave powered on” if you are running ESX 4.1 or higher. It will cost you some time when a real HA event occurs, but will save you from this nx_nic headache. The chance of a real HA event are much, much slimmer than having the cards reset (they seem to do it on a random host monthly). I’ve submitted a number of SRs to VMware, but they are unable to assist as it’s considered a driver issue. If you find a permanent solution, please share in the comments.