Case Study: Troubleshooting Site Recovery

A silly mistake, a site recovery error and a troubleshooting case study, let’s check how it happened.

I was demonstrating Site Recovery in a training. Site recovery is a slow task, so I make the demonstration among other explanations, put the demonstration in the middle of other subjects.

This also doesn’t leave much room to research about problems. On this blog, I will mention a mistake i did and how I solved it.

The silly mistake

I started the Site Recovery demonstration but the virtual machines were deallocated when I started – or at least I believe this was the root cause.

Result: The Site Recovery jobs failed.

 

First Solution Attempt

Well, the jobs were still there, but the replication failed. Now with the machines turned on, let’s try the jobs again. It’s only a matter of selecting the jobs and asking to Restart.

Result: Failed again.

 

Second Solution Attempt

Since repeating doesn’t work, let’s remove and add again. I used “Replicated Items” menu on the recovery services vault and removed the failed virtual machines.

Result: When trying to enable site recovery again, the virtual machines where not available anymore. It was not possible to select them.

Third Solution Attempt

Something was left behind and preventing the virtual machines to appear again for the site recovery.

I checked the extensions on each virtual machine and there it was: The site recovery extension, present on both machines. I uninstalled it and tried again.

 

 

Result: No virtual machine visible

Fourth Solution Attempt

I discovered the replication process also leaves a relation between the two virtual networks registered inside the recovery services vault. Even if the remote virtual network was already dropped, the relation is there and prevents the virtual machines to appear.

Inside the recovery services vault, we use Site Recovery Infrastructure=> For Azure Virtual Machines => Network Mapping and we can remove the link between the virtual networks.

 

Result: No virtual machine visible

Final Solution

After digging a lot inside troubleshooting articles, I discovered on GitHub a powershell script capable to remove all the remaining site recovery configuration from the virtual machine.

I saved the script inside a cloud shell file share, executed it for each virtual machine and all what had left from the site recovery was gone.

Result: Finally the virtual machines were available for the site recovery.

Conclusion

Removing a replication leaves a lot of garbage behind. Unfortunately this was not the first time I saw a process like this leaving garbage behind, but this time I was able to track it down.

It’s not only about the final solution, probably you will need to execute all or many of the steps here to reach this goal.