Skip to content

Identifying and Resolving NetXen nx_nic (Qlogic) NIC Failures

by Chris Wahl on Aug 16th, 2011 | 48,671 views
bomb

I’ve had the worst luck with NetXen NICs, specifically the HP branded NC375 quad port 1GbE flavor, as they continuously seem to experience some sort of issue that causes them to drop traffic and reset. This post contains some information on the NIC and how to troubleshoot the issues that I’ve encountered and resolved.

Driver and Firmware Details

In vSphere, the NetXen NIC uses the “VMware ESX/ESXi 4.0 Driver CD for QLogic Intelligent Ethernet Adapters” driver suite. Here’s a link to the latest drivers (as of 8/16/2011) if you need them. The driver shows up as “nx_nic”.

To determine the driver and firmware version you’re using, SSH into the ESX host and issue this command:

ethtool -i vmnicX   (where X represents the uplink number that is using a NetXen card).

The results should be similar to this:

driver: nx_nic
version: 4.0.598
firmware-version: 4.0.534
bus-info: 0000:04:00.0

To upgrade the drivers, I’m going to assume you have VMware Update Manager (VUM). Download the ISO from the link above, extract it to a folder, and then upload the offline bundle zip file into your VUM repository through the vSphere Client. The host patch is added to your non-critical updates automagically. Scan your hosts for missing updates and then remediate.

Common Issues with NetXen Cards

I run a lot of different types of NICs, From Broadcom (bnx2 driver) to Intel (e1000e driver). I’ve not had any issues in years with these cards. I’m not sure who’s dropping the ball on the NetXen, but these cards just never seem to work 100% all the time.

One major issue that I’ve encountered that actually triggers an HA failover is the infamous transmit timeout. There are lots of results for people having this problem if you search Google. Most of the recommendations are to turn off TCP Segmentation Offload (TSO). However, the March edition of the nx_nic drivers don’t even support TSO, so that’s not the issue. To find out the stats on your card, issue this command:

ethtool -k vmnicX   (where X represents the uplink number that is using a NetXen card).

The results when using version 4.0.585 of the nx_nic drivers are:

Offload parameters for vmnicX:
Cannot get device tcp segmentation offload settings: Function not implemented
Cannot get device udp large send offload settings: Function not implemented
Cannot get device generic segmentation offload settings: Function not implemented
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: off

As you can see, TSO is disabled. Additionally, the driver download page also states that TSO is not a functioning feature.

The issue you will see in the messages.log file of the ESX host affected is as follows. In my case the issue is on a card with vmnic0, 1, 2, and 3.

vmkernel: 5:16:51:21.296 cpu9:4443)WARNING: LinNet: netdev_watchdog: NETDEV WATCHDOG: vmnic0: transmit timed out
vobd: 492682296716us: [vob.net.uplink.watchdog.timeout] Watchdog timeout occurred for uplink vmnic0.
vmkernel: 5:16:51:22.907 cpu17:4458)IDT: 1565: 0x29
vmkernel: 5:16:51:22.907 cpu17:4458)IDT: 1634: <vmnic0>
vmkernel: 5:16:51:22.908 cpu17:4458)IDT: 1565: 0x31
vmkernel: 5:16:51:22.908 cpu17:4458)IDT: 1634: <vmnic1>
vmkernel: 5:16:51:22.909 cpu17:4458)IDT: 1565: 0x39
vmkernel: 5:16:51:22.909 cpu17:4458)IDT: 1634: <vmnic2>
vmkernel: 5:16:51:22.910 cpu17:4458)IDT: 1565: 0x41
vmkernel: 5:16:51:22.910 cpu17:4458)IDT: 1634: <vmnic3>
vmkernel: 5:16:51:22.911 cpu17:4458)<3>nx_nic[vmnic0]: Load stored FW
vobd: 492700761357us: [esx.problem.net.vmnic.watchdog.reset] Uplink vmnic0 had recovered from a transient failure due to watchdog timeout.

From what I can tell, this basically means that the card puked somewhere at the driver level and reset. I have several hosts that use a pair of these cards (one on board, the other in a PCIe slot). I’ve had several occasions where the cards reset and HA no longer has an uplink to use for heartbeats, so it immediately goes into an isolation response – the advanced HA option of das.failuredetectiontime is rendered useless (I have my clusters set to 60 seconds). By default, HA shuts down guest VMs. The timeline is like this:

  1. Cards timeout and reset.
  2. HA immediately panics and the host goes into an isolation response.
  3. The VM guests are shut down (default behavior).
  4. As the VMs are shutting down, the card is reset and the uplink is restored.
  5. HA heartbeats are received once again.
  6. VMs are never powered on at another host location and remained off!
To resolve this, I’ve set the isolation response to “leave powered on”. In vSphere 5, this is default, and from what I’ve read this is a “best practice” anyway to avoid false positives.
I’ve also updated the firmware and drivers to the latest build, 4.0.598, to hopefully resolve this.

TSO in Upgraded Firmware and Drivers

As of the 4.0.598 build of the nx_nic drivers, TSO makes a comeback. As you can see from the results of my ethtool query, I do now have TSO and it is enabled. If this causes a problem in the future, I’ll disable it – however it was disabled previously (or rather, it was not available), so I don’t know how it could cause further problems to have it on. :)

Offload parameters for vmnic0:
Cannot get device udp large send offload settings: Function not implemented
Cannot get device generic segmentation offload settings: Function not implemented
rx-checksumming: on
tx-checksumming: on
scatter-gather: on
tcp segmentation offload: on
udp fragmentation offload: off
generic segmentation offload: off

Updates (9/11/2011)

There was an announcement from HP on 8/11/2011  that a number of firmware updates are also available for the NetXen suite of hardware.

Link: http://bizsupport1.austin.hp.com/bizsupport/TechSupport/Document.jsp?lang=en&cc=us&taskId=110&prodSeriesId=4118472&prodTypeId=12169&objectID=c02964542

The article states clearly:

FIRMWARE UPGRADE REQUIRED to Avoid the Loss and Automatic Recovery of Ethernet Connectivity or Adapter Unresponsiveness Requiring a Server Reboot to Recover

To update your firmware on a VMware host, use the “Firmware Upgrade for the NC375i, NC375T, NC522m, and NC522SFP Adapters Under VMware” section, or get a hold of the NetXen firmware CD/ISO as I have and install over iLO.

Thank you to Ben Richter for discussing his issues with the NetXen cards and emailing me over this link!

Thoughts

Unless you are extremely confident in your network and NetXen cards, set your HA isolation response to “leave powered on” if you are running ESX 4.1 or higher. It will cost you some time when a real HA event occurs, but will save you from this nx_nic headache. The chance of a real HA event are much, much slimmer than having the cards reset (they seem to do it on a random host monthly). I’ve submitted a number of SRs to VMware, but they are unable to assist as it’s considered a driver issue. If you find a permanent solution, please share in the comments.

From → Tech Guides

71 Comments
  1. Mark Hodges permalink - Aug 23rd, 2011

    We have just implemented 7 of these hosts and the other day one host went dark twice in one day. The host completely lost all connectivity on both nics as the same time (which is unfortunate when you run all the networking and storage across them.

    I really hope the HP 930 firmware plus these latest drivers resolve the issue.

    • Chris permalink - Aug 24th, 2011

      I know that our SQL team uses DL 580s and 980s and have reported some network issues as well, which also use these cards, which reinforces my belief that the NetXen card will probably not make it into the next rev of G8 servers. :)

  2. Chris Brennan permalink - Aug 23rd, 2011

    Hi, We had the exact same isolation issue as well with HP NC375T PCI Express cards two different servers on the weekend. Broadcom cards were unaffected. Driver versions of the NC375T cards as below

    [root@pegesx02 log]# ethtool -i vmnic8
    driver: nx_nic
    version: 4.0.550.1-1vmw
    firmware-version: 4.0.534
    bus-info: 0000:0e:00.0

    • Chris permalink - Aug 24th, 2011

      Thanks for sharing your experience, Chris. I have not gone longer than a month without the issue randomly popping up on my DL 580 cluster, so I still have about 3 weeks left before I know if the new 4.0.598 drives fix the issue. Using the “leave powered on” isolation response has been enough to safeguard from a false positive so far. When the nic failure occurs, the host typically has a bang on it with an alarm stating “host error” along with HA noticing a degraded cluster.

      I’ll make sure to update the post with the outcome.

  3. Mark Hodges permalink - Aug 24th, 2011

    Our big problem is that on our hosts we are putting ALL traffic across them so when it drops…well, everything hits the dirt including the VM’s….

  4. VMLOST permalink - Sep 15th, 2011

    Chris,
    I know its a little late in responding to your post, however..I am having an issue (maybe) similar to what is happening above. I have a Dl380 G7 series in place with 2 4 port nx_nics. How when trying to configure Jumbo Frames, I keep getting the following error:” VMkernel failed to set the MTU value 9000 on the uplink vmnic10″ I can create a vswitch and Port group and set it to 9000, however I can’t get the NX_Nics to attach…I have broadcom onboard nics and they work perfectly. I have also followed the post above and it stil hasn’t helped..any ideas?

    thanks in advance

    • Chris permalink - Sep 16th, 2011

      I can’t say I’ve encountered this issue, but I also don’t use (nor encourage) the use of jumbo frames for VMware. There are a number of articles out on the web that reinforce this belief through performance tests – in some cases, jumbo frames can decrease overall performance due to the workload being virtual machines. They do not typically send large strings of data that could take advantage of a jumbo frame. As for your issue, ensure that your vSwitch is set to an MTU of 9000 before trying to do the same on the vmkernel port.

  5. Vaughn permalink - Oct 18th, 2011

    I just called HP Tech and they have a known issue with the NC375T and not being allowed to enable Jumbo frames with ESXi 4.X.

    HP Solution:is downgrade firmware to 4.0.544 as a work around.

    I have not tested this!

  6. Oliver permalink - Oct 24th, 2011

    I lost the four LOMs in a DL580 G7 last week, even with current firmware and driver in ESXi 4.1. Thos whole nx_nix thing seems like a disaster.

  7. Chris permalink - Nov 14th, 2011

    Just another update everyone – I got a hold of some RC drivers / firmware for the cards from Qlogic and STILL had the NIC disconnect / reset on a host. I don’t have the version numbers off hand but will update when I do. Seriously disappointed with this card.

  8. VMdude permalink - Nov 25th, 2011

    Hi,

    I too am having the same issue with configuring Jumbo Frames (want one NC375T nic for iSCSI). I have updated to the latest firmware as of Nov 22 2011, which is 4.0.579
    HP advised to downgrade to 4.0.544, but there is an advisory regarding the unresponsiveness/stability issue that states that the issue is resolved in firmware 4.0.556 and later.

    So it seems to be a catch 22 … either have stability + no jumbo frames, or jumbo frames + no stability

    What a mess, I am waiting for more updates from HP. Probably time to return these cards I think.

  9. Mark Hodges permalink - Nov 25th, 2011

    We completely replaced all our HP dual port nic’s with the Intel X520-SR1 (E10G41BFSR) single port cards at a cost of $12000 (since we only ever used a single port anyhow) and since that time we have not had a single incident or issue with our esx farm dropping.

    I’d call that money well spent because I can sleep at night now.

  10. Chris permalink - Dec 4th, 2011

    We’ve decided to go the same route as Mark Hodges and simply replace the cards with Intel cards. I don’t have the exact model (I no longer work at the company where the servers are located) but the problem affected so many servers (including non-VMware DL980 G7′s) that it was thought to be best to simply rip and replace.

  11. Ken permalink - Dec 28th, 2011

    We have 8 DL580 G7s.
    Each hosts has the NC375i NC375T (2 x 4 ports)

    When we started we experienced the Host Isolation issue.
    Firmware at the time was 4.0.554 and 4.0.554
    Default Driver.

    We upgraded to 4.0.579 firmware using a Windows 2008 R2 installed on a hard drive and the online firmware updater. We also loaded 4.0.594 driver.
    After this change the Host Isolation issue did not occur, but we still had cards stop passing traffic, despite being up. In this condition no VLANs shown in Network Card configuration, but the cards showed 1000/Full. This was a huge problem as VMs were not moved when they needed to be.
    At this point we were using only Link Status as the Network Failover Detection method.

    Sooo…we switched to Beacon Probing for everything. This allowed us to identify network outages and have VMs be moved to functioning network adapters. We also noticed that with firmware 4.0.579 we no longer had both cards fail at the same time.

    We also upgraded to 4.0.598 driver using the vihostupdate.pl command.
    vihostupdate.pl –server –install –bundle C:Downloadsoffline-bundle602ntx-nx_nic-4.0.602-164009-offline_bundle-509624.zip

    After loading the 4.0.598 driver (still using firmware 4.0.579) we had better overall stability. Although we still would get notices from vCenter like this:
    [VMware vCenter - Alarm Network uplink redundancy lost] Lost uplink redundancy on virtual switch “vSwitch0″. Physical NIC vmnic0 is down. Affected portgroups:”vMotion”, “Management Network”.

    Sometimes the alerts would not even show up in vCenter the outage was so brief. VM connectivity seemed pretty good, although we did have several occasions where a host had slow network access and vMotion them to another host resolved the issue.

    Yesterday we began loading the 4.0.602 nx_nic driver from this package.
    “VMware ESX/ESXi 4.x Driver CD for Qlogic for NIC Driver for NetXen HP NC522SFP (P3) Ethernet Devices”
    It loaded successfully on all of our Hosts, although it has not been long enough yet to know if this finally fixes the remaining up/down alerts we experience.
    With the 4.0.598 driver those alerts did not start for several weeks, but then occurred every couple of days or more per hosts.

    We are strongly leaning toward replacing all 8 ports per host with Intel cards if this driver does not resolve the issue.

    This has been a very trying process and I loathe the person who ordered the hardware. They thought “having the same card will lead to less chance of problem due to driver conflicts”… too bad it also leads to a single point of failure when the driver is the conflict.

    I hope this helps others as we have invested far more time than we should have to considering we are using hardware on the HCL from name brand vendors.

  12. Mark Hodges permalink - Dec 28th, 2011

    We are now getting onwards of 2+ months without a single issue since I put in the X520 single port cards. Those HP cards are now all sitting in a drawer.

    Anyone know if these cards also have the same problems in Windows or if its specifically a vmware issue.

    Personally I suspect that having the dual ports is causing the cards to overheat (since one of the HP advisories recommended turning up cooling and not puttting cards into specific slots) and the X520′s seem to run much cooler…

    • Daniel permalink - Sep 19th, 2012

      Do you have a link to this advisory? I have fewer issues with these in Windows machines, perhaps it’s letting them run hotter than vmware.

  13. Ken permalink - Dec 28th, 2011

    Well, we have had 3 alerts since we upgraded to 4.0.602.
    This driver did not resolve the issue.
    Still using 4.0.579 firmware (the latest available that I know of).

    The NC375 cards are quad (4) ports per card.
    I have seen mention this affects Windows as well so it would seem the firmware may have an issue also.

  14. Simon Ling permalink - Jan 3rd, 2012

    We have the NC375T 4port running on a ubuntu server, firmware 4.0.550 and I can’t set the speed to gigabit without the whole card dropping off the network until I restart the networking. ethtool reports that it has changed the settings but afterwards there no activity, even the connection light on the switch dies.

    • Simon Ling permalink - Jan 3rd, 2012

      Please ignore my post as I have just discovered that the problem was in fact the switch is not actually gigabit, although that wasn’t specified on it’s specification only on the blurb! Important life lesson for me there.

  15. Christian permalink - Jan 22nd, 2012

    DL380′s with NC522SFP 10G using 4.0.602 drivers. Still has random disconnects and drops network traffic.

    Anyone have thoughts on the QLE8242?

  16. Brian permalink - Jan 26th, 2012

    My shop have suffered for this issue for few times already…
    This is a serious problem of those qlogic cards!!!!
    below is the message log just capture from one of the DL585 G7
    I have both 375i (onboard) and 2 x 375T add on card
    this firmware affect all cards in the server.

    Jan 26 14:29:30 cimslp: Found 46 profiles in namespace interop
    Jan 26 14:52:38 vmkernel: 10:10:02:53.215 cpu19:4519)nx_nic[vmnic6]: Device is DOWN. Fail count[8]
    Jan 26 14:52:38 vmkernel: 10:10:02:53.215 cpu19:4519)nx_nic[vmnic6]: Firmware hang detected. Severity code=0 Peg number=0 Error code=0 Return address=0
    Jan 26 14:52:38 vmkernel: 10:10:02:53.365 cpu19:4519)IDT: 1565: 0×99
    Jan 26 14:52:38 vmkernel: 10:10:02:53.365 cpu19:4519)IDT: 1634:
    Jan 26 14:52:39 vmkernel: 10:10:02:53.367 cpu19:4519)IDT: 1565: 0xa1
    Jan 26 14:52:39 vmkernel: 10:10:02:53.367 cpu19:4519)IDT: 1634:
    Jan 26 14:52:39 vmkernel: 10:10:02:53.368 cpu19:4519)IDT: 1565: 0xa9
    Jan 26 14:52:39 vmkernel: 10:10:02:53.368 cpu19:4519)IDT: 1634:
    Jan 26 14:52:39 vmkernel: 10:10:02:53.369 cpu19:4519)IDT: 1565: 0xb1
    Jan 26 14:52:39 vmkernel: 10:10:02:53.369 cpu19:4519)IDT: 1634:
    Jan 26 14:52:39 vmkernel: 10:10:02:53.370 cpu19:4519)nx_nic[vmnic6]: Load stored FW
    Jan 26 14:52:44 vmkernel: 10:10:02:58.549 cpu19:4519)nx_nic: Loading firmware from file , version = 4.0.579
    Jan 26 14:52:44 vmkernel: 10:10:02:58.783 cpu19:4519)VMK_PCI: 746: device 000:071:00.0 capType 16 capIndex 192
    Jan 26 14:52:44 vmkernel: 10:10:02:58.783 cpu19:4519)nx_nic: Gen2 strapping detected
    Jan 26 14:52:46 vobd: Jan 26 14:52:46.182: 900180598394us: [vob.net.pg.uplink.transition.down] Uplink: vmnic6 is down. Affected portgroup: Management Network. 1 uplinks up. Failed criteria: 130.
    Jan 26 14:52:46 vobd: Jan 26 14:52:46.182: 900180598509us: [vob.net.vmnic.linkstate.down] vmnic vmnic6 linkstate down.
    Jan 26 14:52:46 vmkernel: 10:10:03:00.316 cpu21:4519)VMK_PCI: 746: device 000:071:00.0 capType 16 capIndex 192
    Jan 26 14:52:46 vmkernel: 10:10:03:00.354 cpu21:4519)nx_nic NetXen NX3031 Quad Port Gigabit Server Adapter Board S/N TI19BK0822 Chip id 0×1
    Jan 26 14:52:46 vmkernel: 10:10:03:00.400 cpu21:4519)IDT: 1036: 0×99 exclusive (entropy source), flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.400 cpu21:4519)VMK_VECTOR: 143: Added handler for vector 153, flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.400 cpu21:4519)IDT: 1133: 0×99 for vmkernel
    Jan 26 14:52:46 vmkernel: 10:10:03:00.401 cpu21:4519)VMK_VECTOR: 231: vector 153 enabled
    Jan 26 14:52:46 vmkernel: 10:10:03:00.478 cpu21:4519)nx_nic NetXen NX3031 Quad Port Gigabit Server Adapter Board S/N TI19BK0822 Chip id 0×1
    Jan 26 14:52:46 vmkernel: 10:10:03:00.595 cpu21:4519)IDT: 1036: 0xa1 exclusive (entropy source), flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.595 cpu21:4519)VMK_VECTOR: 143: Added handler for vector 161, flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.595 cpu21:4519)IDT: 1133: 0xa1 for vmkernel
    Jan 26 14:52:46 vmkernel: 10:10:03:00.595 cpu21:4519)VMK_VECTOR: 231: vector 161 enabled
    Jan 26 14:52:46 vmkernel: 10:10:03:00.673 cpu21:4519)nx_nic NetXen NX3031 Quad Port Gigabit Server Adapter Board S/N TI19BK0822 Chip id 0×1
    Jan 26 14:52:46 vmkernel: 10:10:03:00.772 cpu21:4519)IDT: 1036: 0xa9 exclusive (entropy source), flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.772 cpu21:4519)VMK_VECTOR: 143: Added handler for vector 169, flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.772 cpu21:4519)IDT: 1133: 0xa9 for vmkernel
    Jan 26 14:52:46 vmkernel: 10:10:03:00.772 cpu21:4519)VMK_VECTOR: 231: vector 169 enabled
    Jan 26 14:52:46 vmkernel: 10:10:03:00.832 cpu21:4519)nx_nic NetXen NX3031 Quad Port Gigabit Server Adapter Board S/N TI19BK0822 Chip id 0×1
    Jan 26 14:52:46 vmkernel: 10:10:03:00.878 cpu21:4519)IDT: 1036: 0xb1 exclusive (entropy source), flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.878 cpu21:4519)VMK_VECTOR: 143: Added handler for vector 177, flags 0×10
    Jan 26 14:52:46 vmkernel: 10:10:03:00.878 cpu21:4519)IDT: 1133: 0xb1 for vmkernel
    Jan 26 14:52:46 vmkernel: 10:10:03:00.878 cpu21:4519)VMK_VECTOR: 231: vector 177 enabled
    Jan 26 14:52:46 vobd: Jan 26 14:52:46.683: 900181099377us: [vob.net.vmnic.linkstate.up] vmnic vmnic6 linkstate up.
    Jan 26 14:52:47 vobd: Jan 26 14:52:47.684: 900179885774us: [esx.clear.net.vmnic.linkstate.up] Physical NIC vmnic6 linkstate is up.
    Jan 26 14:52:48 vobd: Jan 26 14:52:48.685: 900180886863us: [esx.problem.net.redundancy.lost] Lost uplink redundancy on virtual switch “vSwitch0″. Physical NIC vmnic6 is down. Affected port groups: “Management

    Network”.
    Jan 26 14:52:50 vmkernel: 10:10:03:04.481 cpu7:4103)nx_nic[vmnic6]: NIC Link is up

  17. Brian permalink - Jan 26th, 2012

    Actually I have 2 523SFP 10Gb card on that server too with latest firmware 4.8.2, I am afraid that qlogics’ firmware still have problem too.

  18. Lutz permalink - Feb 3rd, 2012

    HI, just for clarification: the NC375 / NC375i and 522STP 10G use NetXen-basec Chipsets, which are causing the Issues. But the 523SFP should use a different “Non-NetXen” QLogic-Chip or am I wrong here ?

    We are looking for 10 GbE NICs and HP doesn’t seem to have 10 GbE Intel-branded Adapters. Our current plan was to use the 523SFP (other Options are 550SFP or 552SFP with Emulex-Chips).

    Has anyone tried to get the cards replaced by HP (which are unfortunaletly used as onboard-Nics on the DL370 G6 we currently use) ?

    For Statistics: we are using NC375i on Windows Server 2008 R2 with DataCore on Top and see occational iSCSI-Disconnects for about 5 secs about once every week (Fw.4.0.534), however we see no Packet Errors in Statistics so I suspected the problem somewhere else.

  19. Oliver Antwerpen permalink - Feb 3rd, 2012

    Hi,
    You are right. The NC523 has is ql_nic, not nx_nic. I have several boxes with NC522/NC375 – all causing issues in ESX 4 and 5 with different Firmwares and drivers. I also have several NC550/NC552 Emulex – causing no problems. HP only swapped 522 to new 522, but that ovisoulsy did not help. We are currently planning to rip NC522 and replace with NC550.

  20. Greg permalink - Feb 16th, 2012

    We ave these joke’s of a card in our cluster as well. Have upgraded the Firmware and Drivers and still are having this issue. We are replacing out all the cards with the Intel nics, since HP will only replace the cards which we’ve already done. Dell is starting to look really good with HP’s clear lack of vision and support that has gone down the tubes in the past three years.

  21. afokkema permalink - Feb 27th, 2012

    Same symptoms and troubles here with those NetXen adapters (NC375T). I am working with HP on this case. When I have some news about this case I will update it here.

    P.s. we are running the latest firmware and drivers:

    [root@esx ~]# ethtool -i vmnic4
    driver: nx_nic
    version: 4.0.602
    firmware-version: 4.0.579
    bus-info: 0000:09:00.0

    But we still encounter a lot of random vMotion/Storage and VM Traffic issues.

  22. Brian permalink - Mar 1st, 2012

    I am already tired playing with HP and i got them replace all cards to intels (365T).
    by the way, the 523SFP seems does not have problem on the latest firmware.

  23. lazyllama permalink - Mar 6th, 2012

    We’ve been having the same problem with the NC375T cards.
    Every now and then one will cause the driver to log “Firmware hang detected.” and reset itself.
    We’re already running the recommended firmware and driver so those do not fix the issue (as of 6th March 2012).
    I’ve got an open case with HP to get a fix.

    # ethtool -i vmnic7
    driver: nx_nic
    version: 4.0.602
    firmware-version: 4.0.579
    bus-info: 0000:0b:00.3

    • arjunbalachandra permalink - Jul 10th, 2012

      Hey did you get a fix for this one ??

      I have a client running into same problems ./ I wanted to be certain it is a Hardware level issue.

      I have verified through the logs the message is same in regards firmware puking for no reason.

      Regards

      Arjun
      Vmware Inc.

  24. predragc permalink - Mar 8th, 2012

    Guys, can somebody help me and give me steps to update firmware on NC375i and NC375T cards which are installed in DL980.
    Operating system is ESXi 4.1 driver is latest,4.0.602, but how can I safely update firmare on both cards?

    Thanks in advance.

  25. ollfried permalink - Mar 16th, 2012

    Just download latest SPP (2012-01) from http://www.hp.com/go/foundation and boot from that DVD.

  26. KD permalink - Mar 26th, 2012

    I have several DL585 G7s (NC375i) running vSphere 4.1 that have experienced this issue over the last few months and this entry has been a great help. So I wanted to share that HP support contacted me to say that they are replacing the SPI boards for my 585s with a newer Rev board. The criteria is machines that have the “firmware hang detected” message in the vmkernel logs and they are running the 4.0.602 driver and 4.0.579 firmware. I was told there is a backlog on the boards and to expect delivery in 3-4 weeks.

    • ollfried permalink - Mar 26th, 2012

      Can you share some information, maybe case number? I have open cases, too…

  27. cdunn permalink - Apr 3rd, 2012

    I just wanted to add that we have 10 dl380 g7 with two nc523sfp cards in each. we have two links from each server plugged into two nexus switches. We are getting a single random dropped link on each server periodically. I’m at driver version 4.0.727 and firmware version 4.8.22. I just want to make people aware if they decide to buy this card. could still be a configuration, switch, card, or spf cable problem. Just odd that it happens on all of them at different times.

    • Mark Hodges permalink - Apr 3rd, 2012

      We were using disperse switches, cables and NIC’s and we were droppnig completely.
      After replacing the NIC’s (and using the intel SP’s) we have not dropped once…since Sept..)

      • ollfried permalink - Apr 3rd, 2012

        You replaced the cards with the same type?

      • Mark Hodges permalink - Apr 3rd, 2012

        nope..replaced them with the Intel x520 single port’s I believe…after that no more problems…unfortunately that was 11k worth of hardware replacements I had to do…and we are afraid to use the HP’s for anything else.
        Our main problem with the HP’s was pause frame flooding which would basically kill all traffic on the network and wouldn’t failover to the other

  28. 42 permalink - Apr 9th, 2012

    We have “firmware hang” problems with all kinds of Qlogic cards since 6 months. The DL580G7 onboard NC375i, the quad port NC375T and 10ge NC523SFP. Now we should run a qlogic debugging tool on each ESXi server in the backgroud that takes a core dump in case of a hang… Not very promising

  29. afokkema permalink - Apr 10th, 2012

    We also replaced the NetXen adapter with HP for the Intel 365T adapters. No problems since the replacement.

  30. 42 permalink - Apr 11th, 2012

    We’re now getting new Qlogic cards from HP, including new SPI board with the onboard NIC ports. Don’t know what I should think about that. But we don’t get cards with Intel chips.

    • ollfried permalink - Apr 11th, 2012

      We also get *new selected* QLogic 10Gb Cards. Wonder what will happen then…

  31. cdunn permalink - Apr 16th, 2012

    hp is sending us new cards also but it looks like it will be the same card. I opened a ticket with vmware and they are leaning towards driver/firmware. Mean while i’m going to do some more testing and see if only certain configurations are causing this in our environment. There was a discusion on vmware communities about disabling the onboard nics and having only the 10gb ports be seen by esxi. i tried that and I had the same issue. i also seprated the traffic between 4 10gb links instead of just 2 thinking it could be mtu related and i had the same issue. I’m going to go to an explicit failover and remove iphash to see if that helps. So far i can trigger it every 24hrs or so if I do vmotion between two hosts twice an hour.

    • Mark Hodges permalink - Apr 16th, 2012

      Right now we 2 single port intel 520 10g nic’s and 2 onboard 1g nics active and have not had the problem.
      With the 522 cards you are not able to disable one of the ports on the dual port card so you can’t have any of the 1g nic’s active (since 4-10g means you cannot have any 1g cards)

      We did try going with a single dual port 522 (so we were running 2-10G and 2-1G) and we still fell over with the 522′s…

  32. Jonesy permalink - Apr 17th, 2012

    cdunn,

    It sounds like you have the exact same setup that we have. The randomness of it all was what drove me crazy! We replaced the 523 cards with dual port x520s back in February and have not had any issues since then.

    42.
    Did you ever get anywhere with the firmware dumps?

    If it turns out that HP is replacing “faulty” nics, I might have to call and get the ones sitting in a box replaced with ones that work.

  33. cdunn permalink - Apr 19th, 2012

    just to keep people updated. I’ve removed iphash and the portchannel. The vsphere 4.1 cluster im testing on i have had no issues so far its almost been 48hrs BUT i did make the change on our exchange ESXI cluster and I had a host fail with the Firmware Hang Detected so it doenst look like the config changes are working. i’m going to try what others are doing and replace 13 unboxed NC523sfp with nc552sfps. Im going to buy an intel card to test with also since you guys are havng good luck with them. Also on the esxi host for our exchange servers I didn disable the 4 1gb nics so Im going to try that on those servers while i’m waiting for the other cards. i noticed on the NC523sfp card that hp sent me to replace my other NC523sfp it had Rev: 0C on the sticker. the cards in the servers that are failing are of the same Rev 0C so I’m not sure its going to make a difference changing it. On the vmware thread I’m on other users were mentioning having Rev 0B. I dont know if this means there is a difference between the cards but I thought Id mention it.

  34. nate permalink - Apr 20th, 2012

    Just a note – the HP NC523SFP has similar issues. I have a bunch of servers each have two of the cards and the cards would regularly fail. There was urgent firmware released late last year along with a bios change to increase the cooling but it doesn’t help (didn’t do anything for me). HP and Qlogic are trying to keep it quiet because there are a lot of cards to replace. It took me almost two months to get replacement cards. There apparently was a manufacturing issue with the original sets of cards and there is a new hardware revision that fixes it. For HP the spare part# is 593715-001 (what L2 told me when I told them a big package had arrived with that number on it – my servers are remote so I didn’t know what was in the box). You can’t get this part# if you just call HP and tell them to replace your NIC, you have to go through escalation and stuff. My new NICs shipped from somewhere deep in the innards of HP direct to me, they didn’t go through the field team for our 4 hour on site support.

    I had them replace the NICs in the servers every other day and at least so far it seems to have resolved the issues. The L2/L3 support folks at HP claim to be confident that this revision of hardware resolves the outstanding hardware issues on the 523SFP. I was impressed with how well the HP techs cabled up my systems. I mean I took great pains to label and stuff, but there are 11 cables on each of my DL385G7s, and at least for the 10GbE and FC (boot from SAN) – if you plug em in to the wrong ports bad things happen. But I verified each and every 10GbE port was correct after they replaced the NICs and they were every time. Same goes for FC (with the boot from SAN if the FC cables are swapped the card won’t see the LUN and won’t boot – not hard to fix just go to the bios and re-scan again and change the LUN, but still annoying).

    It’s a wide spread issue, on two different occasions I had full on network card failure – the cards would not pass traffic. I could get them working again for a few hours by rebooting, but then they would die again. VMware would sit there for hours resetting the card over and over. 3 cards in two servers in less than a week – in both cases – despite having 4 hour on site support there was no spare parts in the area. In one case I had to wait 5 days for new cards (and these were not the new cards these were just replacement old cards until the new cards arrived) because of miscommunication inside HP.

    If you see messages like this on boot up – replace your card too (if it wasn’t obvious) :)

    NIC boot code starting cmd peg times out! status 0xffffffff
    ql_update_adapt_cfg failed to init nic rc fffd for pci dev c00
    SetupOSCD failed to init nic rc fffd for pci dev c00
    Abort QLogic NIC boot process

    That first message I saw on one system several months ago, I didn’t know what it was for so I ran diagnostics at the time but didn’t get anything from them.

    Fortunately we designed our systems with two cards each and split our distributed virtual switches across them(active/passive – no load balancing), so when one card completely failed the other one took over pretty quick. Also all of our storage if good ‘ol fibre channel so that never skipped a beat. Also I put all of our service consoles on a separate redundant 1GbE network. I planned for the worst when I built it – I honestly didn’t think I’d ever need it.

    This is a well known issue inside HP, so you shouldn’t have trouble getting on the list to get new cards if you can provide the log files to show the issue your having – they also may have you provide a firmware dump (which can be painful because the firmware dump rendered my NICs inoperable until I rebooted) – also to get a firmware dump you have to wait until the problem recurs, you can’t just trigger it on demand. For me it took about 3 days until the NICs failed again on that particular server. I was expecting the backup NICs to take over but it seems the firmware dump process did something to them, or the drivers, the backup NICs didn’t work either and all VMs lost connectivity. Fortunately I was still able to manage the server with the 1GbE network.

    I feel for those folks that are relying on IP storage and just have a single NIC since these problems cause both ports on the NICs to fail simultaneously.

    hopefully this can help some of you out there. I haven’t posted on this topic on my own blog yet.

    There is another known issue on the NC522SFP (I believe) that I was made aware of recently – apparently with certain passive copper cables can trigger this NIC to flap it’s network ports (not the same behavior as the 523 issue). This is a known issue in the firmware (no fix yet) – and is hit or miss depending on what card you have(different cards with the same model# can behave differently). The only true workaround if you have this issue is to use optical cable instead of passive copper. Or maybe you can get lucky by changing the passive copper cable your using as it doesn’t happen on all cables (again – was told it was hit or miss, they haven’t found a sure workaround other than to use optical).

  35. nate permalink - Apr 20th, 2012

    also – you can’t order new systems with the new cards. My HP VAR has been trying for the same two months to get new NICs for other servers that we bought last year(but are not using yet they are being shipped to europe soon), and so far has not had much luck.

    I’m sure that given the configuration is the same and it’s for the same customer(me) that it’s just a matter of time, but still it’s taking a while

  36. 42 permalink - Apr 24th, 2012

    @Jonesy: after uploading the firmware dumps to HP (about 8 of them) HP deceided to send us new NC375T cards and new SPI boards with NC375i. I’m still waiting for the cards to arrive. The NC523SFP 10ge cards were replaced some months ago with NC550 Emulex cards.

    • Jonesy permalink - Apr 27th, 2012

      Thanks for the update! Did HP actually find anything with all those firmware dumps or are they just throwing different hardware at the problem?

      • 42 permalink - May 1st, 2012

        I don’t know what HP or Qlogic did with the dumps. Shortly after we uploaded the dumps we received the message from HP that Qlogic confirmed the problem and that we’ll get cards with a revision of the board/chip. Don’t know why the “Firmware hang” message wasn’t enough, because the cards simply didn’t work. We had 2 HA failures because the ESXi host was not reachable for > 30 seconds.

      • 42 permalink - May 2nd, 2012

        I received the new NC375T cards today. The revision number is 0G. We have ar least one old card that already has Rev. 0G the rest is 0F. So maybe we are lucky and 0G really fixes the problem.

  37. cdunn permalink - May 7th, 2012

    NC523sfp issues. I’ve been testing with x520-da2, nc552sfp and nc550sfp cards and I’ve had zero issues with either of the cards. I left a host in each of my vsphere clusters with the nc523sfp cards and they would fail every few days the other cards would stay up. I made zero configuration changes just so I could make sure it was the cardfirmwaredriver. I’ve been testing this for the last couple of weeks. We are going to purchase nc552sfp cards as are our solution to this issue. The 20+ nc523sfp cards will be resused in our windows servers. There is another guy over on the vmware forums doing similar testing and he’s had similar results.

    • Mark Hodges permalink - May 7th, 2012

      So, the 522 and 523 cards have been failing in the vmware environments, but they do work properly without failure on Windows Servers?
      I would really like to use these 14x 522 cards if I was confident they wouldn’t fall over on us…
      Waiting on our HP rep to investigate replacing them….

      • cdunn permalink - May 11th, 2012

        We are testing it with our netbackup solution which is on a windows platform. I hate to have the nc 523sfp cards just sit around not being used so I’m hopeful i can. Based on the comments it sounds like that might not be the case. I was hoping the issue was more driver related. We are still doing testing so i dont know if they will work for us in window or not.

  38. Ken permalink - May 7th, 2012

    I would be careful about using these cards in your Windows Servers, I have seen references on the Internet that this can affect Windows as well. People with large SQL servers were reporting the issue.

    We replaced ours with Intel cards about 3 months ago and not have not seen a single issue since then related to the network. I was not interested in being a Beta Tester for QLogic/HP.

    • cdunn permalink - May 15th, 2012

      Ken you are right after further testing we are having issues with the windows servers also. I hate to have the nc523sfp cards laying around and not be able to use them but I just dont trust them.

  39. lazyllama permalink - May 9th, 2012

    We have had two replacement (rev.G) NC375T cards sent to us by HP and they seem to have resolved the issue.

    We had 4 pre-rev.G cards which were causing us problems. We swapped all of the faulty cards out for NC364T cards we had in stock as soon as we were able as loss of network connectivity on ESX hosts using Ethernet-connected storage was causing outages and corruption.

    Unfortunately HP will only replace the other two cards if we put them back into servers and reproduce the error with associated logs etc. I’m not prepared to put customer services at very real known risk in order to prove each card is faulty so we will have to take the hit.

    Extremely unimpressed by that latter response from HP and quite glad that we’re switching away from them in the near future.

  40. Jim permalink - May 31st, 2012

    Am running into this problem despite updating firmware on the cards.. presumably I need to install the VMware driver *as well as* the firmware update? When the card “crashes” I can see in the vmkernel the firmware version of the card (4.0.579) but am running an older driver (4.0.550.1-1vmw) …

  41. Arthur permalink - Jun 4th, 2012

    We have been experiencing issues with all qLogic NIC’s except the NC522SFP. Sure they had issues early on but have been very stable since the last firmware release in late 2011.
    Unfortunatly the NC375T’s and NC523SFP were so unstable they have been banished to the LAB and replaced with NC365T’s and NC552SFP’s respectively.
    Now to the NC375i, these have been simply horrible. HP refused to accept there were issues with them early on, qLogic released a firmware and driver update to address the precise issue that HP said didn’t exist. Since the last firmware and driver update the issue still persits. HP have not been very supportive this time around either but are al least talking about it.
    I hear there is a new SPI riser board available which uses updated hardware, apparently this resolves the issues with these NIC’s pausing.. They only seem to do this when pushed very hard.. I see utiliation of upto 600Mb/s then the NIC’s just stop transmitting and recieveing. All hell breaks loose, alarms start ringing and the phone calls start.

    I’m over it, I really need a solution which does not involve me throwing money at HP to replace something we have already purchased but doesn’t do the job it’s designed to do.

    Oddly enough Dell are aware of these issues and have offered assistance.. They are looking to get a foot in to the DataCentre per say.. They might get it yet.

    Simply Dell are offering to be part of the solution, HP seem content to be part of the problem.

  42. Arthur permalink - Jun 4th, 2012

    Just a work for HP if you are reading.
    Everything can suffer problems, what defines the better party is the method they utilise to resolve these problems.

  43. 42 permalink - Jun 13th, 2012

    After all our NC375i and NC375T were replaced by cards with a new hardware revision we had no firmware hang problems for 5 weeks…. until today. One NC375i card had a firmware hang this morning. So it seems the new revision improves the situation but does not solve it 100%.

  44. Arthur permalink - Jun 17th, 2012

    There is a new firmware/Driver package available. I’ve no idea if this resolves anything and I have no confidence in qLogics statements about it resolving anything.. Hey the last 2 firmware/driver packages have resolved the issue.. Obviously NOT.
    http://h20000.www2.hp.com/bizsupport/TechSupport/Document.jsp?objectID=c02964542
    This package is for NC375i, NC375T, NC522SFP nic’s

    After applying the package on a Dev server I now see.
    driver: nx_nic
    version: 5.0.619
    firmware-version: 4.0.588

    There was a catch to this.
    After applying the package via the VMWare update manager.
    The server rebooted as expected.
    When the server was back up the VC could not connect to it.
    Further investigation showed the NC375i nic’s were not initialised or connected, even though they were before applying the package.
    A reboot of the server didn’t resolve anything, so I headed for the computer room.
    I needed to in this case power the hardware down to disconnect the ILO as it was connected to my PC at another site which was not accessible to me from within the computer room. So I truely pulled the power on the server.
    Powered the server up again and ILO’ed to the server from my laptop within the Computer Room..
    To my surprise the NC375i nic’s were all up and working.
    It appears in at least some situations it maybe necessary to power the server off after applying this Firmware/Driver package.
    Dodgy Dogy stuff HP or more to the point QLogic.

    I hope this rather arduous process isn’t necessary on all servers but there’s my experience so far.
    Just to repeat
    Dodgy Dogy stuff HP or more to the point QLogic.

    • Oliver Antwerpen permalink - Jun 17th, 2012

      This is also true for Emulex CNA FW-Upgrades, but Power->Cold Boot via iLO does the job. You do nt need physical access to pull the plugs.
      Also, the driver/fw you are referring to is three month old and did not fix the problems.

  45. Tim P permalink - Jul 6th, 2012

    Not sure how old this forum is but we have been experiencing issues with the NC522sfp / DL580 G7 servers for a year and a half. We finally have a resolution so I hope this helps everybody here. It turns out that there is a manufacturing issue with the Qlogic chipset that of course affects the NC522SFP 10Gbe cards as well as the onboard cards on a Dl580 G7 server. If you run the following on an ESX host, you will determine if the server is affected.:

    cat /var/log/vmkernel* | grep -i -e “firmware hang” -e “device is DOWN”

    What you have to do is open a case with HP and get the SPI board or the NC522 card replaced. HP replaced every single one of our cards and server SPI boards and we have lots of them…(20) DL580 G7′s running ESX and (20) NC522 cards. What a nightmare this has been but I am glad they have a solution. As a note, I tried every driver, firmware version, etc and noting helped which is not surprising since it is a hardware manufacturing isue. Knock on wood, all is good thusfar and yes it doesn’t look good on the resume when you have hosts going down and people are pointing the finger at you. By the way, the problem has lasted since early 2011 and we are in July of 2012. Good Luck!

  46. Erik Briggs permalink - Sep 24th, 2012

    9/24/2012 – I have been having the issue about once every 6 months (Dl585 G7) until about 2 weeks ago. It has happened 3 times since then. Over this past weekend, I updated to the firmware dated 9/4/12 (4.0.588) which says it specifically addresses this issue. I also updated to the latest drivers. Within less than a day, the server already hung the NIC before anyone hit the system yet (<1 hour of use).

    I opened a case today, and they are shipping me a replacement SPI board. The new firmware/drivers do NOTHING to help this issue, as it is a hardware problem.

  47. Mark Hodges permalink - Oct 27th, 2012

    We finally 12 months later and a discussion about moving to IBM have replacement 5222SFP Nic’s that the rep states is a known issue with the earlier revision cards.
    The old revision cards were 0B and the new revision are a 0H….I put 2 cards into an Windows server and never had a drop eventhough I happened them with 1.3TB of data transfer…
    Will put 2 into a new ESX 5 box at some point and see if they survive…

  48. Peter C permalink - Nov 30th, 2012

    We’ve been chasing down this problem for nearly two months and narrowed in the NC375T network adapter as well. Most vDS connections have a redundant connection to another model NIC so we moved the problem uplink to standby. No combination of drivers and firmware fixes the problem. HP just sent out a replacement card which seems the same (though I did not check the hardware rev -next time I open the server I guess) , although the problem only occurs under load so time will tell. In the meantime we’re replacing one of the NC375T adapters per server with an Intel based card instead. Painful.

Trackbacks & Pingbacks

  1. Quiet Before the Storm: HP DISCOVER in Vienna « Wahl Network
  2. HP NC375i Netzwerkadapter: Resetting the device because the device is not responding « layer9.
  3. NetXen HP NC522SFP Network Flooding | Le cloud de Piermick

Leave a Reply

Note: XHTML is allowed. Your email address will never be published.

Subscribe to this comment feed via RSS