vSphere 5 has brought of a lot of really useful improvements to the table, and in my mind the new fault domain manager (FDM) that replaced Legato’s Automated Availability Management (AAM) is at the top of the list. The ability to determine failure by both the network and datastore heartbeats offers a lot of protection.
In this post, I’m going to simulate a network failure of a host to reveal the process that occurs when default cluster values are used. As you may recall, vSphere 5 uses the default isolation behavior “Leave powered on” whereas vSphere 4 used the “Shutdown Guest” option. I’ve also configured a pair of datastores for the storage heartbeat feature in vSphere HA.
In order to expedite the network piece, I’m using a pair of virtual ESXi 5.0u1 hosts with some small footprint Linux guests.
The two hosts in my lab DR cluster, vESX93 and vESX94, are both virtual guests of my main cluster. They are configured with 6 NICs so as to better simulate a production host:
- vmnic 0 & 1: Management and vMotion
- vmnic 2 & 3: VM traffic
- vmnic 4 & 5: iSCSI Storage
The management vmkernel (vmk0) on the DR-Mgmt dvSwitch uses vmnic0 and vmnic1
The vSphere HA election process has determined that vESX94 is the master and vESX93 is the slave. Both hosts have equal access to all storage resources. If I had more hosts in the cluster, there would be multiple slaves but still one master.
Additionally, each host is running a single VM guest.
- vESX93 is running Pineapple-01
- vESX94 is running Pineapple-02
Finally, here is a view of the storage heartbeats occurring on the two datastores named NAS1-ProdDR and NAS1-ISO. Note the heartbeat (hb) files and the poweron files, which contain a list of powered on VMs running on each host.
Network Failure of vESX94 (Master)
To simulate a management network failure of vESX94, the vSphere HA master of the cluster, I’ll simply disable its two management uplinks.
Because the hosts are virtual machines, I can simulate failure by simply disconnecting their management uplinks
vSphere HA Process
The network failure is noticed almost immediately, and the cluster reports two different vSphere HA states in rapid succession.
As a reminder, the cluster first looks like this:
After the network failure, host vESX93 enters an election state because the vSphere HA agent on vESX94 (formerly the master) is not reachable. Feel free to insert some political jokes about elections here (I’m looking at you, Florida).
Now that vESX93 is the master, host vESX94 has reached a final state of “Network Isolated”.
The cluster now appears like this:
A Note on Isolation vs Partitioned States
vESX94 is isolated because it is not able to reach any other hosts. Even though it can view the storage heartbeat data, there is no election traffic flowing to it from another host.
VM Not Restarted
Because the default state is to “Leave powered on”, the VM is not touched. This is actually a good thing, as the VM is still running just fine.
As you can see here, Pineapple-02 is still reachable
If I change the cluster behavior to “Shutdown Guest” then the VM is gracefully shut down and restarted on remaining host.
Here you can see that the VM is now running on vESX93 and shows an vSphere HA restart in the events log
Hopefully this has given you a good glimpse into what occurs during a failure on a vSphere HA enabled cluster. Trying to figure this out during an outage is a rough trial by fire, so hopefully you also get a chance to play with this on VMware Workstation, your own lab, or a hands on lab elsewhere. A nod to the cluster experts, Duncan Epping and Frank Denneman, for providing really solid documentation on this process in their technical deepdive book and online at the Yellow-Bricks site.