Big Improvements to vSphere HA and DRS in 6.5 Release

Two of the bread-and-butter features of VMware vSphere are High Availability (HA) and the Distributed Resource Scheduler (DRS). With them, a collection of independent vSphere hosts start to look and act like a real cluster. If a host fails, HA will make sure the VMs that were running on the host are powered on elsewhere in the cluster. If a host starts to get loaded to the point that VM entitlements are not met, DRS will move the VMs around to make sure they get the resources they demand and consume. These are complex technologies that we take for granted today – but upon release were revolutionary.

In this post, I cover several of the enhancements to HA and DRS that will be found in vSphere 6.5, along with some of my thoughts and snark on them. ­čÖé

HA – Admission Control

Admission Control is a feature that watches over the vSphere Cluster to ensure that ample demand remains in the event of a failure. Specifically, it blocks the power on operations for VMs that would violate the needed resources for a cluster. There are many ways to do that – slots was a popular method in the early days, and then percentages of resources, with a standby failover host being the least favorite.

In vSphere 6.5, Admission Control is being simplified and more smarts are being baked into the feature. You now select the┬áhost failure quantity the cluster will allow and resources will be calculated based on the size of the cluster. This is sorely overdue and was previously solved with scripting. If desired, you can tweak a few nerd knobs, but it shouldn’t be something you do unless there is a clearly defined functional design requirement that isn’t being met with the automatic setting.

ac-settings-65

HA – Orchestrated Restart

Another overdue feature is the ability to use SRM-like orchestration for HA restart operations. If you’re not familiar with SRM (Site Recovery Manager), it has a slick orchestration wizard for ordering various VMs for making sure that applications are allowed to activate in a specific sequence – such as beginning with a database, then middleware applications, then web servers.

orchestrated-restart

This is controlled using VM Groups and VM Rules. Groups allow you to build applications and/or tiers of applications by logically adding VMs together. You could potentially put all of your database VMs into a group, and then make the middleware / message queue / other VMs into a second group.

vm-groups

You can then make the second group dependent on the first. I can see this being handy because it’s cluster wide, seems easily scriptable and defined by non-manual processes, and would be nice to have automatically baked with third party applications (such as importing an OVA or group of OVAs). It further extends what can be done for distributed applications.

vm-rules

Proactive HA

The final HA feature that I really enjoyed learning about was Proactive HA. In this case, the OEM hardware vendor can trigger alerts and have them picked up by HA. These failures may not have an immediate impact on the host – such as a single drive failure in a mirror or having a single DIMM start to flake out – but should be triaged sooner than later. In these cases, the host can decide to evacuate the VMs running on the host in anticipation of a failure.

This seems handy for those running a large scale data center in which the SMART or embedded sensors are deep enough to really understand what’s going on with the system. It helps avoid those “oops” moments where a hardware failure time bomb is ticking, no one is watching the alert, and then suddenly a bunch of VMs are down and restarting onto another host. Avoiding downtime due to hardware failure seems like a good thing for applications that aren’t built in a distributed fashion.

DRS – New Policies

This one made me chuckle a little, because myself and many others have suggested a number of these features and were noticably named and shamed by VMware employees.

First up is the idea of distributing the VM workloads more evenly across a cluster. It used to really drive me wild to see a few hosts running at (or near) 99% utilization while a few other hosts would be nearly idle. I even wrote a few PowerShell scripts to help mitigate this because maintenance mode would take forever when it hit those super loaded hosts (and never seemed smart enough to start with the lightly loaded hosts).

drs-policies

Additionally, you can tweak the memory settings (active vs consumed) and even account for CPU over-commitment. These are all things that I think a solid VMware administrator would do using various calculations and scripts. With vSphere 6.5, they are automated. Huzzah?

Network Aware DRS

The final DRS improvement is the ability to count network utilization into DRS calculations. As you may already know, network (and disk IO) are largely ignored when balancing resources. It sort of makes sense – network IO is a bit bursty and ephemeral in nature, unless there’s some sort of elephant stream going through the network. The Network Aware enhancement means that DRS will now eyeball the host network saturation via the transmit and receive values on the host network adapter. This should help avoid an oversubscription of network bandwidth, but will remain a lower priority to other resources such as CPU and memory.