Troubleshooting is a difficult task. It is often accompanied by stressful situations such as an outage in production, loss of revenue, and possible career limitations. Those who can think under pressure and troubleshoot complex issues are highly valued by organizations.
I’ve been a long time fan of a method I call Sphere Elimination. It is a way to visualize a complex problem and distill it down to mental images in the form of spheres. Each sphere represents a component within the stack as it relates to the problem. The spheres overlap with other spheres, requiring you to test and eliminate the possible culprits until only the real root cause remains.
Identifying the Spheres
To better explain this process, let’s imagine that you’ve been asked to troubleshoot an issue. In this fictitious scenario, an administrator has noticed that some of the NFS datastores mounted to your pair of ESXi hosts are randomly lost and then, after a short period of time, come back online. You’re given full admin level access to the environment.
Mentally, spheres are created:
- The ESXi hosts
- The storage array
- The physical network interfaces
- The physical and virtual network switches
- The VMkernel ports
- The network subnet
- The virtual machines
- … etc.
There are quite a few components and configuration items that may be the culprit. They form a mesh of spheres.
This is a start to the troubleshooting. We have a rough idea as to what might be causing the issue and can begin eliminating spheres.
There are many different ways to tackle this problem. I would likely want to see which hosts are having the issue before diving deeper. Setting up a pair of spheres to represent the two ESXi hosts, we can see if the issue exists on just one host – which is the red dot with a 1 – or on both hosts – which is the red dot with a 2.
- If just one host has the issue, we can eliminate the yellow sphere and focus on the blue sphere.
- If both hosts have the issue, we need to move on to different spheres.
In this troubleshooting scenario, you determine that only Host A – the blue sphere – is suffering NFS disconnects. Perhaps you checked the logs or simply saw the datastore disconnect while watching the GUI. Either way, we can safely remove the yellow sphere for Host B because the problem is related to Host A. The problem is now less complex.
Further comparisons will eliminate other spheres. Examples of spheres:
- Spheres of Datastores: Do all of the NFS datastores disappear or just specific datastores? If it’s all the datastores, you’ve eliminated any spheres related to the datastore and export configuration.
- Spheres of Physical Ports: Does it matter if you move the physical network connection to a different port? If not, you’ve eliminated the possibility of a bad cable (although it’s rare anyway).
- Spheres of Configuration: Are there any differences in the physical and logical configuration on the switch port? If not, you may want to keep looking at the ESXi configuration.
The Root Sphere
Because we’ve eliminated the yellow sphere for Host B, all future spheres must include something to do with Host A. This reduces the “surface area” of the issue because no matter what, the problem is directly impacting one specific host.
In fact, when you look at the configuration of Host A, you end up finding an IP conflict on the VMkernel port used for NFS traffic. This could also have been found when looking at the ESXi logs and seeing a duplicate IP warning.
Complex problems can always be distilled into smaller segments with simpler decision points. By working through a series of spheres and eliminating any spheres that do not overlap, you can quickly get down to the root of an issue and hone in on the real problem.
I’d also advise trying to start with simple, easy tasks first – things that you can troubleshoot with a low level of effort or that take the smallest amount of time.