Safely Performing Live Host Maintenance with vSphere DRS
I do a lot of maintenance related activities with VMware vSphere clusters. At its heart, this really boils down to spending a fair bit of time putting hosts into maintenance mode and shifting around virtual workloads. The vSphere Distributed Resource Scheduler (DRS) makes most of these tasks trivial as the cluster will automatically select where to place workloads without any real effort on my part.
In some scenarios it becomes necessary to perform maintenance in such a way that still leaves a virtual workload on a host. I use the term “live host maintenance” to describe a host that is not in maintenance mode but is not yet ready for active, priority workloads. One example would be to test various feature of the host, such as a load or network test. In this situation, you are unable to use maintenance mode on the host but still do not wish to have any other workloads on the host, especially if you are working on an issue. This is where your good friend Migration Threshold priority 1 comes in handy.
I never recommend disabling DRS if you have it enabled. There are many caveats to doing this, especially in environments that use resource pools or leverage another product that relies on resource pools such as VMware vCloud Director. Disabling DRS will destroy your resource pools and cause much headache.
Migration Threshold Priority 1
I would imagine that there is some confusion around leveraging this strategy as two other automation levels exist: manual and partially automated. My issue with these modes is that they require acting on DRS migration suggestions. When you put a host in maintenance mode to evacuate it, you’ll have to wait for DRS to come up with a list of recommendations and then approve them. Perhaps this is what you want, but in the vast majority of my experience this is simply an annoyance.
I recommend using the fully automated level with the migration threshold set to priority 1. Let’s look over the GUI for some reasons why.
Some highlights from the DRS Automation Level page
As shown above, manual and partially automated clearly state that they will only do suggestions but will not actually move workloads. Only fully automated will act upon suggestions without user intervention. When doing maintenance activities, I don’t want to add more work than is necessary by having to accept migrations especially if I’m scripting parts of the activities.
Also notice that priority 1 recommendations only act when trying to satisfy cluster wide constraints and maintenance requests. In a typical environment, this means that virtual machines will migrate off a host that you wish to work on and not come back unless some specific DRS rule forces them to (such as a VM to Host requirement).
Quick Evacuation Tips
I recently went through this process when testing out a new networking configuration using the vSphere 5.1 LACP feature. Although everything stated it was operational from the host and network side, I wanted to test it out first with a test VM before committing my more vital workloads. The process is straight forward:
- Set the cluster’s vSphere DRS level to Fully Automated and Migration Threshold to priority 1.
- Select the host you wish to do work on and enter maintenance mode to force all of the VMs off. This is a shortcut to let the cluster balance the workloads over the rest of the hosts.
- Once finished, exit maintenance mode on that host.
- Migrate a test VM onto the host and perform your tests.
- When satisfied, revert the vSphere DRS level back to the original state. For my lab, that’s Migration Threshold priority 3 (default).
This is also scriptable from PowerCLI if your task requires automation, which is common for a repetitive event in a larger cluster.
Most of the time in which I need to do live host maintenance activities revolve around the network. I’d imagine that the new Network Health Check features in vSphere 5.1 will further reduce the need to do this, but corner cases will always exist. Knowing how to quickly and effectively test your hosts prior to putting priority workloads on them is a key skill for managing a vSphere cluster. You should be familiar with different methods to non-invasively work with vSphere DRS in order to be successful at performing maintenance activities.
For more clustering details, check out these list of books on Duncan Epping’s Yellow-Bricks site.