vSphere & Storage DR without SRM – Part 1: A thought Process
I’m hoping to provide a possible option to an issue that some people may come across in their environments. Creating a workable Disaster Recovery solution without utilising SRM… but the scenario I’m working through this solution for is a little more complex.
So, with a normal vSphere environment there are a number of ways to provide a Disaster Recovery solution… many of them I have used myself. With two data centres being used, you could utilise Site Recovery Manager for your DR option (as long as you have two vCenter servers as well). Another option is to utilise vSphere replication on its own to provide a level of Disaster Recovery with manual failover. Other options include tools such as Veeam Backup and Replication to provide a replica of the virtual machines in the second site.
The scenario I’m investigating is where there is already array replication taking place between two data centres and where the data centres are configured in vCenter in a metro cluster. In this scenario, there is no desire to introduce a second vCenter server. There is no desire to switch away from the array replication process, as this is more efficient than other solutions. The final item is that there is no desire to have a number of additional virtual machines showing in the environment from other forms of replication. Please note that although the storage is replicated, the LUNs are only actively presented to a single data centre at a time and therefore a metro cluster failover is not an option.
As a side note, when it comes to the actual commands, they will be created for an IBM XIV environment.
So my thought process is to create a process (which may utilise different scripts) that will allow the environment to perform a test failover to confirm operation (failback as well) and that could also be performed in a real disaster scenario.
With this in mind, we have to think about the obvious disconnect between the storage replication and the virtual machines themselves. Understanding where the virtual machine files reside (well at least the vmx file), what datastore name and path is going to be a requirement for almost any solution in this type of scenario. So I would think that we should have a script running that will regularly take an extraction from the vCenter environment of the virtual machines, their datastores and paths.
We may also want to take a regular export of the LUN ID mappings on the storage as well, this is all data that may prove to be useful later on.
For a Disaster Recovery Test, my thought process says that the following steps will be required:
* Power down VMs in primary data centre
* Remove VMs from primary hosts – remove from inventory
* Unmount storage in primary data centre
* Remove storage mappings for LUNs on primary storage
* Refresh hosts in primary data centre
* Stop Storage Replication and wait for final replication
* Map storage LUNs to secondary hosts from secondary storage
* Mount storage in secondary data centre
* Register VMs into secondary data centre using secondary storage
* Power on VMs on secondary hosts
* Test communications
* Reverse Storage replication
* Start storage replication
The process can then be repeated for a failback.
In a real disaster, the process would probably begin at the secondary location section. This process also means that you are focusing on ALL VMs on the particular LUNs you are failing over, you can’t do a piecemeal recovery.
Over the next few parts, I hope to develop this solution further and start to work on the relevant scripts to piece it all together.