NUMA Woes – Battling High CPU %RDY on vSphere 4

I recently started seeing a number of alerts on one of my clusters that many VMs were getting a high %RDY for CPU. I thought this weird, as the vSphere hosts are beefy sized HP DL580 G7s loaded with 4 sockets of Intel x7542 processors and there aren’t many VMs running. Typical CPU usage was under 20% per host. How could the VMs be contending for CPU resources in this situation?

The Reliable ESXTOP To The Rescue

Notice the high amount of usage of the 4th socket and the large amount of %RDY.

Digging into ESXTOP revealed that there wasn’t all that much going %USED going on. A few VMs were doing a lot of work, but not enough to tax the host. Then I noticed that all of the load was focused on the 4th socket (CPU3). The other 3 sockets were relatively unused. I pinged twitter about this phenomenon and got a quick reply back from superstar Matt Liebowitz.

Matt to the rescue!

I didn’t know that much about NUMA (Non-Uniform Memory Access), beyond the fact that you should have it on to increase performance and that it was a way to keep VMs on the same socket as the memory that it serves. The hypervisor will schedule vCPUs for a VM on the same physical socket, along with using memory that the socket has direct access to, for better performance. Frank Denneman has a really good writeup on this topic that you should visit. There’s also a VMware KB entitled “esxtop command showing high %RDY values for virtual machines running on NUMA enabled ESX/ESXi 4.1 hosts” that discusses seeing %RDY with NUMA enabled.

Diving Into NUMA Statistics

At any rate, it seemed that for some reason, my host was consistently picking NUMA node 3 (CPU3) for VMs, which was causing contention. I tried to vMotion the VMs to another host, but they stayed on node 3. When I vMotioned them back to the orignal host, again, node 3. Node 3 was being hammered with utilization, but the host kept insisting to use node 3.

To troubleshoot this issue, I manually set a CPU core affinity on a few VMs to verify that it fixed the %RDY issue (in which it did) but reverted it as that sounds like a scary box of nastiness to use in practice. I only did it to ensure that the other sockets could in fact be used (they could). I don’t advise setting any manual settings on a VM as they can be tricky to find later.

NHN = NUMA home node. The N%L is node locality %, which indicates how much memory is being used from the home node. All of the heavy hitter VMs are on NHN 3.

The Resolution

It turns out that this is a documented issue in vSphere 4.

To avoid performance problems, the ESX/ESXi NUMA scheduler is coded to initially place virtual machines on a node where the most memory is available. If the nodes have very little memory, the scheduler uses a round robin algorithm to place virtual machines onto NUMA nodes. This typically only occurs when powering on a virtual machine.
With vMotion, the copy of the memory happens before the worlds are completely initiated. A single node is chosen as the home node to preserve memory locality and virtual machines are all placed on the single node. Performance issues can occur if there is a light workload running on the ESX/ESXi host. Under light workloads, NUMA node migrations do not occur as there is not enough load to warrant a migration.
To work around this issue, disable NUMA on the system by enabling Node Interleaving from the BIOS of the server.

I was told by a VMware technical support representative that this is verified fixed in vSphere 4.1 update 2 but is not out at the time of this article.

Update 9/27/2011: Disabling NUMA (Enabling Node Interleaving)

To disable NUMA you must enable Node Interleaving. The HP bios gives a clear warning that this may/will impact performance, but I doubt it will hurt as bad as CPU %RDY values over 20% on multiple VMs.

In this HP bios pic, choose Advanced Options > Advanced Performance Tuning Options > Node Interleaving > Enabled

The resulting ESXTOP screen looks much healthier for the same workload.

Much better CPU %RDY values at the cost of increased memory access times.

Update 11/02/2011: Confirmed vSphere 4.1u2 Resolves Issue

I’ve recently installed build 4.1u2 on a host shown above and confirmed that the VMware ESXi hypervisor is properly balancing the VMs across different nodes.

The VMs are balanced across multiple nodes (NHN) and showing good locality (N%L)

Thoughts

I’ve disabled NUMA on my large scale-up boxes that do not have significant CPU load to rid the high CPU %RDY issue. With no specific ETA on the 4.1u2 fix, I figure it’s better to eat a little higher memory access time as opposed to having double digit %RDY on multiple VMs all day. I have not yet tested this on vSphere 5 – perhaps the scheduler has been made more intelligent?

This case also shines a light on the power of “professional” social networking as a number of smart folks on twitter really came through with some comments / suggestions to help me target the problem.

Of course some of the comments were just goofy and fun! Thanks to everyone on twitter for their help. :)