I’ve spent a lot of time working on complex issues in virtualized environments where the most obvious symptom has nothing to do with the actual root cause. While there are some obvious issues that exist, such as overloading a host with workloads it isn’t able to handle, there are a lot of subtle ones out there, too. Maybe Bob, an application owner, just fired off an update to a few hundred of your VDI workloads and caused a massive boot storm – as a virtualization administrator, you realistically have limited control over events like this. It can be difficult to get to the root of the problem, track down Bob, and politely ask him to not do that again in the future. 🙂
I’ve been given the opportunity to work with a preview release of Xangati’s latest advancement in cloud monitoring, cleverly labeled StormTracker, which will come bundled with both the VDI and VI dashboards. In typical Xangati fashion, this product is not about the bells and whistles surrounding the interface, but instead about giving an administrator quick response times, concise information, and clear visibility into their environment. While it can’t completely take the place of solid cross-team collaboration (you should still work towards this goal), it does give you a lot of clues into the environment by using its automated pattern-matching heuristics to identify a storm’s cause and what was impacted.
Real-Time Monitoring Advantages
From my perspective, the primary issue with storms is that they are often quick and volitile. A boot storm isn’t going to wait around for an hour and give you an easy opportunity to analyze it, it will typically be in the range of 5 to 15 minutes in duration and then go away. You may have all sorts of latency and application performance alarms going off, but it can be extremely hard to translate those into the root cause if you can’t get your hands on the problem data as it occurred. Xangati uses their “DVR-like” technology to provide a playback method to see all of the events as they happened.
The real time advantage is further emphasized by a quote from Alan Robin, the CEO of Xangati:
“Assuring performance in a cloud environment requires a new approach backed by next generation architecture. Only Xangati with StormTracker is tackling the critical problem of performance storms impacting the evolution to the cloud. Our solution which is based on a novel in-memory computing architecture, continuously tracks all data center objects and their cross-silo interactions combing large scale clouds for storm patterns. Competing solutions are based on inherently sluggish database driven architectures which enable transient performance storms to fly under-the-radar and lead to devastating consequences in the data center.”
In the graphic below, I’m sharing a glimpse into the cloud view of StormTracker. Each white “puff” is a piece of data, that has been distilled down into a color. White clouds are happy, and dark clouds are suffering some sort of issue. When a dark cloud is clicked on, it glows red and highlights all related storms in red, too. In this particular photo, a host is experiencing a storm and is related to storms on a particular datastore and interface.
The playback bar allows you to move back and forth through time in a seamless and real-time fashion
If you’ve used Xangati’s dashboard products, the StormTracker works in a very similar fashion. You can double click on any object on the screen to instantly be taken to an analysis screen of that item, which again includes a playback bar to see how the storm developed. Storms are rated on a variety of metrics, such as the duration, severity, average severity, and max severity (a few of these are valid for repeat storms). These values can offer a quick way to perform triage when a number of events are occurring in a similar timeframe.
You can run a report to see what contention(s) occurred and what objects were affected (victims). As shown below, this particular storm relates to memory contention.
Looking at the storm, we see that there were four “victims” of the storm. Instead of looking at the victims, StormTracker pinpoints the root cause.
In some situations, the storm is caused by a simple truth – you are over committed to the point where you need more hardware. Xangati’s StormTracker can analyze the storms and give recommendations based on capacity. It will reveal to you a “correlation value” of storm to capacity, which can help make a business case for more hardware.
As an example, let’s say you receive a lot of storm alerts for memory contention. The StormTracker report may correlate the data to prove that you’ve simply provisioned too many virtual machines for the amount of available physical RAM, and offer a graphical view of how things will continue if left alone. Here’s an example image of the trending graph.
It looks like this environment is on its way to running out of memory
Not Just a Virtualization Admin’s Tool
Using role based access, this is something you can (and should) hand out to the other teams. Unless you’re a jack of all trades, multi-hat wearing admin (and I know you guys are out there), it will be helpful to allow other teams to look at StormTracker to help correlate their own issues to what Xangati sees. The other teams can even turn off information they aren’t interested in, and it only affects their login account. As an example, your storage administrator can disable alerts on memory and CPU as they typically won’t impact their view of storage alerts.
While the product doesn’t fully GA until the end of September, I’m sure you’ll be able to get a glimpse on the preview at VMworld (booth 2413), which is coming up very quickly. I’ve been using Xangati in my home lab since early 2012 and have been consistently impressed with the amount of data it can collect, especially around the use of Netflow on my distributed switches. The really big advantage of this tool, for me, has always been the ability to do a real-time replay of data to get my hands on the actual issue as it occurred. I look forward to upgrade the home lab once this product releases officially.