vSphere 5.5 Improvements Part 6 – Site Recovery Manager (SRM) and vSphere Replication

It’s been really interesting to watch the evolution of vCenter Site Recovery Manager (SRM). What once was focused entirely on the management and orchestration of LUN level replication, via a vendor specific Storage Replication Adapter (SRA), has continued to elevate momentum on vSphere Replication technologies. This makes a ton of sense, as it’s painfully obvious that LUNs are continuing to lack the agility and functionality necessary to support an abstracted data center model. (To be fair, I severely dislike both LUNs and RAID, so I’m biased here).

However, SRM has typically lacked the robust features necessary to compete with some other DR focused products out there – primarly Zerto – due to its extremely hands-off approach to pushing around data. I’ve rarely seen a client adopt vSphere Replication (VR) over LUN replication, mostly due to the stigma of VR being an inefficient approach. It’s hard to disagree with that.

VMware is aiming to change this with the introduction of a new release for SRM, including a lot of sexy new features for vSphere Replication. As you’ll soon find out, there’s a lot more skin in the game around handling actual replication traffic, instead of offloading the responsibility to the storage array or a replication appliance, such as EMC’s RecoverPoint.

Replication Without vCenter

One of the most notable improvements is that you can, with vSphere Replication, begin to create replication topologies that include sites without a vCenter Server. For those with ROBO sites, or just smaller data centers that are clustered in a metropolitan area, this is a very awesome thing indeed. By leveraging a VR Server Appliance and VR Agents, you can now push around bits to any site you wish.

In the example below, you’ll see that the Main Office DC is pushing VMDK1 to the Secondary DC. The Secondary DC is pushing VMDK2 to the Remote Office. And the Remote Office is pushing VMDK3 to the Main Office DC.


RPOs and Point in Time Instances

Although VR still supports an Recovery Point Object (RPO) of no less than 15 minutes – which is quite sluggish compared to a “near zero” requirement but should be good enough for a good chunk of use cases – you now have the ability to support point in time instances. That sounds cool, right?

The questionable part is the way these points in time are captured: vSphere snapshots. There’s definitely two camps on snapshots, mainly those who hate them and those who don’t. The haters camp is usually comprised of those with an application that fails to quiesce or are chatty, or has been burned by a giant string of snapshots that took days to commit.

vSphere snapshots are used by a lot of different tools out there to grab underlying disk for backups and such, so the use of snapshots (for any reason) are a necessity for many environments. I have mixed feelings on this. On the one hand, I’m not sure I really want to keep 20+ snapshots on my shadow VMs in order to support a point-in-time-ish recovery model. On the other hand, I will say that there definitely needs to be some kind of mechanism in place to support restoring from more than the last replication, because otherwise you run the risk of having a worthless VM replica that suffered corruption or a virus.

At least the snapshots aren’t kept on the protected VM! Here’s a view of the point in time instance policy setting in the new VR:


Also remember that the maximum number of snapshots supported is 24, so you can’t exceed that many point in time recovery snapshots.

Snapshot Instance Math

The math for figuring out when your instances expire is a bit complex. I’ll do my best to break it down 🙂

  • The RPO value dictates the maximum amount of data loss you’re willing to tolerate, measured in hours.
  • The Retention Policy dictates how many snapshots should exist within a given day, measured in a whole number value.
  • The most recent instance of a protected VM is always retained no matter how the policy is configured. This ensures you can always recover the VM.

If you set an RPO of 1 hour, your VM is going to replicate every hour to the recovery site. That’s pretty straight forward. The Retention Policy then works to clean up snapshots so that you only keep the required amount of daily snapshots. If you chose 4 instances per day, for example, that means you should have one instance every 6 hours (6 hours * 4 instances = 24 hours = 1 day). Snapshots in between the 6 hour marker will be discarded.

This means that you may end up having 4 snapshots of your VM at 6 hour intervals, even though you configured a 1 hour RPO. Let’s just hope that last replicated instance isn’t corrupt, because otherwise you’re going back in time by another 6 hours.

Here’s a view of the snapshots on a shadow VM. Fun? Ugh. 🙁


Performance Improvements

  • Snapshots aside, there are some pretty nifty improvements with VR. The first is the ability to use Storage vMotion on your protected VMs. If you have a datastore that is filling up or suffering poor performance, you move around the protected VM without having to worry about it mucking up SRM. That’s handy.
  • The use of a VSAN with VR is also supported. This seems logical, as VR just sees it as a storage source to pull data from, but it’s nice to know for sure that you can use it.
  • The vSphere Web Client will show you details on your vSphere Replication status when you click on the vCenter object. This includes target sites, outgoing replications, and incoming replications. You can also monitor or manage your vSphere Replication directly from tabs located in the web client. No more trips over to the C# client’s plugin for your SRM goodies. 🙂
  • And finally, the VR Agent is now able to use parallel I/O paths to push data around. For those replicating a higher quantity of VMs, you should notice a boosted performance from the vSphere VR Agent to the VR Server at your recovery site.