Explicit Failover Shenanigans when Upgrading to ESXi 5.X

Found an issue when upgrading my lab hosts to ESXi 5 using VUM and thought I’d share in a format that is a bit longer than 140 characters.

The Scenario

It seems that when a dvSwitch is used with portgroup team failover policies, the upgrade process does not respect the team failover policies after the upgrade. I have not tested this on a regular vSwitch.

In my case, I have a 4 uplinks in a dvSwitch for management and VM traffic. For management traffic, 2 of the uplinks are marked active and the other 2 are marked unused. For VM traffic, the opposite holds true.

  • Mgmt1 (vmnic3) / Mgmt2 (vmnic12) = management traffic and VLAN
  • VM1 (vmnic7) / VM2 (vmnic13) = VM traffic and VLANs

Keep in mind that this is not a typical setup and is only something I did in my lab to test something unrelated. I set this up because the physical switch is not trunked for the traffic to cross any of the uplinks, so I’m using failover policies to ensure the correct uplinks are picked.

The Upgrade to ESXi 5.x

After letting VUM upgrade my host, I waited. And waited. Where is my host? Why is it not connected to vCenter?

I connected to it via iLO. IPs look good, switches are right, uplinks are uplinking, lights are blinking, flux capacitor is … fluxing. But I can’t ping anything. Let’s check ESXTOP.

Here’s the issue – the management (vmk0) and vMotion (vmk2) vmknics are trying to go out the wrong uplinks! vmnic7 is for VM traffic (VM1 on the dvSwitch).

I did a quick and dirty fix – removed vmnic7 from the uplinks on dvSwitch-Main.

The results are successful – the vmknics chose vmnic12 (Mgmt2) and the server is now responding.

So, Now What?

Maybe this is just a glitch with the upgrade? I thought so to, so I applied a host profile from a known good working server. Guess what happened?

Once again, the vmknics hopped onto the wrong uplinks (this time VM2 / vmnic13).

Update! (November 11th, 2011)

Thanks to Brent Meadows (@bmead21) for submitting a case to VMware on behalf of this issue. It has been officially recognized with KB 2008144 entitled Incorrect NIC failback occurs when an unused uplink is present.

The resolution, until this is patched, is as follows:

To work around this issue, do not have any unused NICs present in the team.

Thoughts

If you use explicit failover for your dvSwitch portgroups, be wary when upgrading. Also, if you encounter this issue in the wild, please comment with any details or a resolution. To fix this issue I had to go into the Network config via vSphere Client and add the uplinks I removed with the CLI.

Also, if you are having trouble with software iSCSI, this bug can affect you as well.