Nimble Storage keeps a copy of the required block metadata in the flash tier, with a backup copy in the spinning disk. A low priority task continuously looks for heavily fragmented data stripes on the system reaches 50% capacity. When they are identified, it will attempt to write new stripes by combining a heavily fragmented stripe with new data or another heavily fragmented stripe. I happened to guess correctly that they use a circular logging system, which evenly wears the system. If for some reason the array hits 95% capacity, the system boosts the garbage collection task to a higher priority. I was told that this has only occurred in a situation where a very unrealistic write test was being performed.
This circular system avoids any need to perform re-balancing when a new shelf is added. Traditionally, a re-balance ensures that all of the newly available disks are used for performance. Due to the architecture of CASL, there is no performance difference when additional disks are added – new data writes are serialized to disk in large stripes, and data reads are either in NVRAM, DRAM, SSD, or result in a cache miss and data retrieval.
Scaling Out and Up
This is the part I found most interesting. Most systems are super slick to install and configure, but become a huge pain in the rear when an upgrade or expansion is required. I recall many weekends spent in the data center doing forklift storage upgrades (replacing the controllers) because that is the only way to scale up and out.
Let’s start with scaling up with a beefier controller or SSD. Nimble Storage controllers can be removed and replaced while in production. Swap out the passive controller for a beefier model, fail over to the new controller, and then swap out the formerly active controller. That’s really cool! The same principal holds true for individual SSDs – one can swap them out (one at a time) for larger sizes.
When scale out is required, the solution is to add shelves. Once the total amount of shelves (3) has been hit, Nimble Storage allows for a federation of storage controllers via their Nimble Connection Manager. This requires a VIB to be installed on the vSphere hosts to handle the PSP (Path Selection Policy) to multiple controllers for writes. Think of it as similar to the Round Robin PSP, but with storage arrays instead of adapters on a single array.
This kernel level module also handles data look-ups for data reads by way of a mapping table, which is provided by a master node (called the Group Leader). This is simply a role that one of the controllers holds, and can be passed to another in a failure situation – they call this “passing Leadership.”
Support and Monitoring with InfoSight Engine
The final piece I’ll touch on that raised an eyebrow was the InfoSight Engine based on a discussion with Larry Lancaster, Chief Data Scientist at Nimble Storage. Using data gathered from all of the various Nimble arrays deployed for customers, the team has written all sorts of analytics to crunch the data and recognize issues or trends. I’ve seen other similar solutions that are used for health monitoring, such as NetApp’s AutoSupport or EMC’s ESRS, but those are typically limited to more break/fix or capacity planning events. The Nimble Storage team talked about how they’ve leveraged this data warehouse to help push out code improvements and pin point specific ways to improve efficiency for their systems. The system owner can also log into a secure portal and view their relevant data, alerting on storage health or capacity concerns.
Data is sent to the InfoSight warehouse over an encrypted HTTPS posts. Involvement with InfoSight is opt-in, and the team tells me that well over 90% of customers choose to send in data. Customers can get some very deep insight into the array’s performance, trending both capacity and performance with their forecasting engine. The ability to gather and send in data points is baked directly into the array – there is no need to create a server elsewhere to do this. By logging into the InfoSight Portal, an administrator can address any technical concerns and an executive can view dashboards to figure out what future spends are upcoming to meet array demand. InfoSight will literally show you performance trends and give you a suggest as to what tweaks need to be done to meet demand – such as scaling up to larger cache size, SSD size, or trays of capacity.
I’ve done a decent bit to try and scratch the surface of what I learned about the Nimble Storage product. I’m still waiting to play with replication, snapshots, and clones – although I’m assured they are in the system. Overall I am impressed with the amount of thought that has gone into the Nimble array, especially around the data path for writes and the global monitoring with InfoSight. And while I’m still on shaky ground with iSCSI, at least I’ve found one more system that strives to challenge my crusty ways and try new things. 🙂
On the plus side, iSCSI offers the ability to use block storage, which is a requirement for some legacy workloads (e.g. Exchange). It lets the administrator take advantage of VMFS, use multipathing to the storage array, and build out a (potentially) familiar SAN architecture. Keep in mind, though, that a great array can’t fix a poor network. Ethernet is inherently a lossy protocol. For a unified fabric configuration, you’ll need to find a way to leverage something like priority-based flow control and class of service to protect the storage traffic with both iSCSI and NFS. Another option is to use an isolated network or a DAS model with something like a Cisco UCS Appliance Port – which is fully supported by Nimble Storage and Cisco via their SmartStack architecture.