Okay, so the key thing with any Disaster Recovery solution is to test it to make sure that it works… the unfortunate situation is that most companies will not allow you to go ahead and power down the production systems to confirm that Disaster Recovery will work. This therefore means that we have to come up with solutions to try and test the environments as best as we can.
The company that I work for, on the other hand, said that the only way to prove that the company can survive a disaster was to perform a real disaster recovery test, in which the virtual machines in production would be powered down (as if affected by a real disaster) and powered up in the recovery location. The recovery location would then run the virtual machines for a period of one week before failing them back to the production site. The chosen software to provide this solution was VMware Site Recovery Manager.
Our preparation time for this disaster recovery test was short, as we had to coincide with some other systems being recovered and therefore we decided to perform this first test in this manner by recovering roughly 60 virtual machines to our recovery site. We began this test just over 1 week ago, with all 60 virtual machines being recovered to our recovery site in 1hr 40 minutes (including some IP Address changes)… in reality this was actually only around 1hr but there was a long delay whilst the storage LUNs were removed from the production environment… in a real disaster, where the protected site no longer exists, these steps would be skipped. This is a great improvement on our previous disaster recovery tests, where we were only able to recover roughly 50 virtual machines and it would take 4 – 8 hrs.
Site Recovery Manager worked well for this recovery although we did come across an unusual situation that we believe is linked to using Ephemeral Ports on a vDS with Site Recovery Manager. Randomly, some virtual machines would be recovered but would not be connected to the network when they were powered up at the recovery site. VMware seem to be a little blank on this when questioned about it but we believe that the number of available ports does not increase quickly enough to cope with the extra virtual machines being recovered and therefore the machines are left in a state where their network ports are disconnected. We are currently testing a solution where we are switching back to Static port binding (where under v5 of ESXi, they will auto expand as required) to see if this resolves the issue. I will provide an update on this once we have completed the testing.
This morning we performed the failback of the environment after utilising the ‘Reprotect’ functionality of the Site Recovery Manager software. This again worked really well, the same delay was experienced when removing the storage and the same issue with nics on virtual machines being disconnected was experienced but the recovery time was roughly the same as the original failover.
Site Recovery Manager is a very capable piece of software and performs really well… my only gripe is around the cost of the solution… if you have the opportunity purchase this as part of the vCloud Suite (the Enterprise Edition has Site Recovery Manager included) as this proves to be more cost effective when utilising a high density virtual environment. Purchasing Site Recovery Manager on its own is actually very expensive compared to other solutions but it does work.