OCS Disaster Recovery, Part 2

There are several possible disasters which might happen to your Office Communications Server environment. This will be everything from a file share which becomes corrupted, all the way over to a complete OCS server recovery. None of these will present a problem if you've prepared for disaster recovery beforehand. Johan explains what to do when disaster strikes.

In the previous article, we talked about what information we need to backup from Office Communications Server (OCS) to prepare for Disaster Recover, and which tools you can use to do this.

Now it’s time to put that preparation to the test, and break our Office Communications Server test environment to demonstrate how we can fix several possible disasters. While I fully expect you to use your natural sense of caution, I must nevertheless warn you to not do these tests in a production environment, and please always perform them in a dedicated test environment.

Let’s first list the disaster’s which we’ll create & tackle:

  • Disaster 1: The administrator accidently removes the location profile
  • Disaster 2: A database gets corrupted
  • Disaster 3: The administrator makes several changes, but then wants to restore the original settings
  • Disaster 4: Edge services won’t start anymore
  • Disaster 5: The Standard Edition Front End Server crashes
  • Disaster 6: The whole site is unavailable due to a disaster

We’ll look at all six of them separately, describing each problem together with the steps needed for recovery of the Communications Server environment.

Disaster 1: The administrator accidentally removes the location profile

As you may know, a location profile contains a lot of information, such as the normalization rules, policies, phone usage records, and routing information, and this profile is used by both the Mediation Server and Front-End Server. So removing it will break the whole voice part of your OCS environment.

The location profiles and normalization rules are stored in the Active Directory, so if someone removes them, there are a few recovery options available to you:

  1. Recreate the location profile and normalization rules manually
  2. Recreate the location profile and normalization rules by using the Enterprise Voice Route Helper to import a saved configuration,
  3. Completely restore the Active Directory

So, here’s the problem; on Friday evening, during maintenance, an administrator accidentally deletes the location profile called Utrecht (Utrecht is a city in the Netherlands, for those of you wondering what the word means) even though he receives the following warning:

1128-JV1.jpg

Figure 1 – The warning against deleting your OCS location profile.

Within a few seconds, the location profile is removed and the voice part of OCS doesn’t work anymore.

Importing a Saved Configuration

We will skip recovery option one as this is, in my opinion, not really a sensible option, and I think it’s very clear how to reconfigure the location profile manually. If it’s not clear to you, please have a look at the previous series of articles. Instead, let’s start with option two, recovering the location profile and normalization rules using the Enterprise Voice Route Helper. This tool is part of the OCS 2007 Resource kit, and can be downloaded for free.

Once you downloaded the tool, just open it and select the file > Import Routing Data option. This will open a new window where you will need to select the backup file.The backup file can be created using the same tool but instead of using the file > Import Routing Data choose the option file > Export Routing Data.

Now, bear in mind that, although the location profile was deleted, there is still a little bit of the voice configuration left behind, specifically the phone usage data, policies, and routing. So, we’ll select the option Merge data from the import file with existing data, and remove the checkmark in front of the option Only import location profiles and policies. This will ensure that a complete restore will be made from the voice configuration.

1128-JV2.jpg

Figure 2 – Importing and merging the backup information.

Once the configuration is loaded by the Enterprise Voice Route Helper, we can choose the file > Upload Changes option to complete the process. After the configuration has been uploaded, you’ll get an overview of which settings have been uploaded successfully and, in the cases where settings have been merged, a remark is made.

1128-JV3.jpg

Figure 3 – Change report after using the import functionality

This is the fastest and most painless solution for recovering your Location Profile, so now let’s take a look at the more time-consuming way.

Recovering Active Directory

Personally, I wouldn’t choose this option because it brings a lot of risks with it and, during the Active Directory restore process, users can’t logon to the domain if you have only one domain controller (which is the case in our test environment, although you’ll have more than one domain controller in most production environments).

Since we’ve previously made a backup using Windows Server Backup, we also need to use that same tool for restoring the Active Directory. One of the requirements of performing an Active Directory recovery is that you will need to restart the server in Directory Services Restore Mode and logon locally, so before doing this, make sure you know the DRSM password and stop all OCS services (or, even better, turn all OCS machines off).

Next, reboot the Domain Controller in Active Directory Restore Mode and start Windows Server Backup, where you’ll need to select the option to perform a recovery, at which point a wizard will be started to guide you through the process. Select the most recent backup of your domain controller and, when asked what you would like to restore, select the system state

1128-JV4.jpg

Figure 4 – Performing an Active Director recovery using Windows Server Backup.

Because this is a domain controller, you will be asked if you need to perform a non-authoritative restore or an authoritative restore. A non-authoritative restore will just report the object but not increase the number of the object. If you have multiple Domain Controllers this will have as result that they will remove the restored object. When performing an authoritative restore the object will be restored and the number will be increased. By increasing the number of the object the other Domain Controllers will understand that this is a new object, which will result in the object being replicated to the other Domain Controllers. Depending on your Disaster, you might have to choose for an authoritative restore, for example when you have multiple Domain Controllers; but since we only have one Domain Controller, we can select the non-authoritative option. When the system state is restored and the server has restarted, you may want to start ADSI edit to verify that the location profile has been restored successfully.

1128-JV9.jpg

Figure 5 – Confirming the location profile restoration.

You will find both in the Configuration partition of the Active Directory in the Services > RTC Service node. Finally, once you’ve confirmed that the location profile has been restored, you can start the OCS servers/services again.

Besides using your backup software to restore the objects there are a few other options:

  • Active Directory Recycle Bin, this is a new feature of Windows 2008 R2 and needs to be enabled manually. If your not using it at the moment have a look at this article.
  • ADRestore, a package which is a part of Sysinternals
  • Object Restore for Active Directory, a package which can be downloaded for free from Quest Software

As the article will get to big if we will look at all the tools I just pick one: Active Directory Recycle bin.

First we want to know which items are deleted, to find them we will use the Get-ADObject cmdlet which needs to be executed from the Active Directory Module for Windows PowerShell:

This will search for all deleted items in the configuration partition of our Active Directory. You might want to finetune this as this may give a pretty large list. In the example below just two entries are visible:

1128-JV5.jpg

Figure 5 – Overview of deleted Active Directory items.

Now we have the overview we can fine-tune our filter since we have more information about the object we would like to restore. There are multiple values which we can use: DistinguishedName, ObjectClass or ObjectGUID. Besides fine-tuning the filter we will need to pipeline the output to the Restore-ADObject cmdlet which will restore the object to its original place. In this case we will only need to restore the Active Directory object which has msRTCSIP-LocationProfile as ObjectClass:

The above command will perform a query that will search for all deleted items which have MsRTCSIP-LocationProfile as Objectclass. The output is then used by the Restore-ADObject to restore the object. As you can see this is less work then performing a restore using your backup software. Unfortunately you will need to have a forest level of Windows 2008 R2 for this which might not be the case in every environment.

Disaster 2: The Database Gets Corrupted

The second Disaster will not happen often, but may be caused by, for example, an unsuccessful upgrade of the Backend database by one of the OCS patches. If this happens, several services, including the Front End Service, cannot be started anymore because some configuration data is stored in the no-corrupted database.

When looking at our test environment there are two databases which are really important:

  • RTC (Real-Time Communications) – used for storing persistent user data: access control list, allow/block lists, contacts, home server/pool, and scheduled conferences.
  • RTCConfig – used for storing persistent OCS 2007 R2 global-level, pool-level and computer-level settings. Although this database is optional as most settings are stored in Active Directory.

Let’s assume both get corrupted due to an upgrade error; in this case, we have two options for restoring the database, the same as we used for making a backup of the databases which is described in part 1 of this article:

  • Restore using SQL Server Management Studio
  • Restore using Windows Server Backup

The choice as to which option you’re going to use to restore the OCS databases depends on which software you’re using to create the backup of the databases in the first place. However, before performing either option, stop all OCS services.

Restore using SQL Server Management Studio

Let’s start with the SQL Server Management Studio option; once the tool is running, perform the following actions:

  • Right-click on the corrupted database
  • Select the Tasks > Restore > Database option
  • Check if the correct database is selected in both the to database and from database fields
  • Select the Options > Overwrite the existing database option
  • Finally, press the OK button to start the restore process

Perform these actions for both databases (RTC and RTCConfig) and, once they are restored, start all OCS services again.

Restore using Windows Server Backup

Next, let’s try the second option; restoring the databases using Windows Server Backup. Use the following steps to restore the databases:

  • Select the Recover…> This Server option
  • Select the Remote Shared Folder source option
  • Select the most recent backup
  • Select the Files and folders option, and select the volume where the folders are located
  • Select the Original location option, followed by the Overwrite the existing versions with the recovered versions option
  • Finally, check the summary and click on the Recover button to start the restore process

After the restore has been completed successfully, start all the OCS services.

Disaster 3: The Administrator Makes Several Changes, But Wants to Restore the Original Settings

In this Disaster, an administrator has made several changes to the global settings, but forgot to write down the original ones before the changes were applied, and now they want to restore all the settings to what they were before they started.

To restore the original settings we can use the lcscmd cmdlet, which we discussed previously, with the /config parameter:

Using the /action:import parameter we will tell the Lcscmd that it has to import data from the XML file which is specified as /configfile parameter. As we only need to restore the global and pool settings we will use the /level:global,pool parameter to tell Lcscmd to import both the pool and global settings.

1128-JV6.jpg

Figure 6 – Logging of the restoration of the original OCS settings.

The above screenshot is a part of the logging which is generated while executing the command. When you would like to see detailed logging, you can check the log file which is created automatically in the temp folder of the user; as you can see, it’s an html file, just like the OCS setup logs files.

Disaster 4: Edge Services Won’t Start Anymore

After installing a Windows update and rebooting, the Edge services might not start anymore, which will prevent communications with external contacts. On top of this, colleagues who are trying to connect from outside won’t be able to anymore.

As the backup procedure was not explicitly mentioned in the previous article let’s first have a look at that step. Well it’s exactly the same as how you can make a backup of the computer configuration using the LCScmd command, for example:

Recovering an Edge server is not very difficult, and are just a few steps needed to bring your Edge Server back into production. However, to make this Disaster worse, you don’t have a backup from the server – only an export of the configuration settings. In this case, we’ll need to perform the following steps:

  • Reinstall the OS
  • Reinstall the OCS Edge role
  • Recover the configuration
  • Reinstall the certificates

The reinstallation of the OS, OCS Edge role and certificates should proceed exactly as normal, so I won’t cover them here. As for the configuration, the nice thing about the Edge Server is that it has the option to import a configuration during its initial setup, on the appropriately-named Configure Edge Server page. With this option we will use the backup file which was created using the LCScmd command. This file is a XML formatted file which contains all settings of the Edge Server.

1128-JV7.jpg

Figure 7 – Importing an Edge Server configuration.

Just select the import settings option and then use the browse button to select the backup of the configuration you want to load. After the installation is completed, reinstall the certificates, start the services, and your Edge Server is up and running again.

Disaster 5: Standard Edition Front End Server Crashes

This is a really nasty Disaster: our complete Front End Server crashes and, as it contains both the Front End Server installation and the SQL databases, we’ll need to restore both the OCS configuration files and RTC database. The restoration of our Front End Server can be divided into 5 steps:

  1. Reinstall the operating system
  2. Install OCS 2007 R2 according to the settings of your original server
  3. Restore the RTC database
  4. Restore the content of the file shares: Meeting Content and Meeting Content Metadata
  5. Restore the settings

So, to recover our Front End Server, we’ll first install the operating system and the prerequisites for OCS 2007 R2 as normal. In the previous series of articles I described how to install OCS 2007 R2 step-by-step, so we’ll skip those steps in this article.

Once we have recovered the OS and installed OCS 2007 R2, it’s time to recover the RTC database. In part 1 of this article we discussed how to make a backup using SQL Server Managent Studio, so we will use the same program to restore the database. To restore the databases perform the following steps:

  • Stop all OCS 2007 R2 services
  • Copy the backup file of the database to a local volume, as you can’t restore a backup file using a UNC path.
  • Start SQL Server Management Studio
  • Right-click on the RTC database and choose the Tasks > Restore > Database option
  • When prompted for the Source for Restore, select the From device option
  • Click on the button with the three dots
  • Select the Add option, and then click OK to close the window
  • Select the backup which you would like to restore, in most scenarios this will be the most current backup, followed by the Options item on the left side of the screen
  • Select the Overwrite the existing database option
  • Click on OK to start the restore process

Now that the database has been restored, it’s time to restore the global, pool and server configurations using Lcscmd (as we did earlier). First we’ll restore the global and pool settings:

Once this is done, we can restore the machine-specific settings:

Now that all settings have been restored, we can start all OCS services again. To check if everything is working correctly, we can use the internal validation tests which come bundled with OCS, and which can be found in the administration tools of OCS:

  • Open the OCS management tool and expand the Forest
  • Expand Standard Edition Servers, followed by the Pool
  • Right-click on the Front End Server, select the Validation option, and perform all of the validation tests

1128-JV8.jpg

Figure 8 – the OCS validation tests.

Once all of validation tests have passed, you can do some additional functionality tests by logging in and using the Communicator and Live Meeting clients. When the results of all tests are successful, users can then continue to use OCS as normal.

Disaster 6: The Complete OCS Site is Unavailable Due to a Disaster

This last Disaster is the worst thing that could happen to a company in terms of an OCS disaster: a complete OCS site is lost. In this Disaster we’ll have a disaster recovery site in place, which contains a second OCS environment which can be brought online in an emergency such as this. Given that I have yet to discuss the setup of a Disaster Recover site, I’ll run through that process now.

Deploying a DR OCS site

Let’s first discuss how you can build a disaster recovery (DR) site for OCS; the requirements for the DR OCS site are listed below:

  • The Active Directory domain of the DR site must be the same as the original site’s
  • The network configuration should be the same
  • The Pool name must be different from the one used in the production site
  • The _sipinternaltls and _sip._tcp/_sip._tls DNS records should be modified to point to the DR site
  • The CA (Certificate Authority) server needs to be available
  • The backups of the primary site will need to be accessible from the DR site

The first step is to ensure you will have the necessary hardware available, including load-balancers and voice-gateways if those are used in your production environment. After you have all hardware available, it’s time to install the servers.

We’ll start with the installation of a domain controller (DC) and ensure that it’ll contain a replica of the production environment. This can be easily done by adding the DC as an additional one and simply placing it in a separate site. Once you’ve brought the DC online, you can install the CA server for the DR site, which will only be used by servers in the DR site. Once you’ve set up the core infrastructure, you can start deploying your DR OCS site.

For this, I’d like to refer to the steps mentioned in my previous series of articles on how to install OCS, but there is one thing you will need to keep in mind; the OCS Pool name must be different from the one used in production to prevent any conflicts.

In addition to the default software, you’ll need to install SQL Server Management Studio on the Front End Server in order to restore the databases from the primary site in case of a disaster.

Failover to DR site

Now for the real work; your primary site is completely lost and you will need to bring back the OCS services ASAP. Because external DNS changes may take very long to come into effect (24 hours), it may be a good idea to change them first, and ensure that they point to the Edge Server in the DR site. The following records will need to be changed:

  • sip.domain.com: needs to point to the external IP of the Access Edge Server
  • meeting.domain.com: needs to point to the external IP of the Webconference Edge Server
  • av.domain.com: needs to point to the external IP of the Audio/Video Edge Server
  • ocs.domain.com: needs to point to the external IP address of the reverse proxy Server
  • _sipfederationtls._tcp.domain.com: needs to point to sip.domain.com
  • _sip._tls.domain.com/_sip._tcp.domain.com: needs to point to sip.domain.com

Once the DNS changes have been submitted, it’s time to recover the RTC database, which can be done using the same steps as in Disaster 5 (Recovering from the crash of a Standard Edition Front End Server).

After the restoration has completed, start all OCS services again. Since the clients are configured to automatically configure their internal and external servers, we will need to modify the DNS record in our internal DNS zone to ensure that configuration happens correctly:

  • sip.domain.com: needs to point to the Front End Server

_sipinternaltls._tcp.domain.com: needs to point to sip.domain.com Now that all these changes are made, you can check if everything is working again by running the validation tests as previously discussed. Once all of the tests have been successfully completed, it’s time to migrate the users from the old pool to the new pool, although since the old pool is not available, we will need to perform a forced migration of the users.

Migrating users can be done via two ways:

  • Via Active Directory Users and Computers
  • Via the OCS Management Console

Let’s explain both of them:

Active Directory Users and Computers

  • Open Active Directory Users and Computers
  • Select the OU where the users are located and right click on it
  • Select the option Move Communication Server Users
  • Select the DR Site OCS pool
  • Check the option force

OCS Management Console

  • Open the OCS Management Console
  • Expand the Earlier Servers node
  • Expand the pool
  • Right click the Users node and select the option Move Communication Server Users
  • Select the DR Site OCS pool
  • Check the option force

If you want to first make sure the DR site really works, then move a subset of the users, including your own account, before moving the rest of the users. Once migrated, try to sign in using the Communicator client and very that everything works, although this may require an ipconfig /flushdns to clear the DNS cache of your client.

Recovering the Production site

When the production environment has been brought back online, you may have the option to reuse your servers if they’ve not been damaged by the disaster that brought your site down in the first place.

In this case, we first need to create a backup of the RTC database used in the DR site. When this has been done, you can login to the OCS production environment to deactivate the server roles in the following order:

  • Microsoft OCS 2007 R2, Audio/Video Conferencing Server
  • Microsoft OCS 2007 R2, Web Conferencing Server
  • Microsoft OCS 2007 R2, Web Components Server

The deactivation can be done using the following steps:

  • Ensure all OCS services are stopped, as this will ensure that no one is connected to the production OCS environment
  • Expand the forest node, followed by the Standard Edition Servers node
  • Expand the pool
  • Right-click on the Front End server and select the Deactivate option, followed by the specific server role which you would like to deactivate

Repeat this process for all the server roles mentioned above, but don’t deactivate the Front End Server role just yet. The next step is to remove all of the users from the production pool, as they have been migrated to the DR pool. Before you do this, make sure you have created a successful backup of the RTC database in the DR site; after verifying that you have, right-click the Users node and choose the Delete Users option.

When all of the users have been removed from the pool, it’s time to remove the pool itself, and this can be done by finally deactivating the Front End Server using the steps mentioned earlier.

Once all of these tasks have been completed, you will need to remove the following software components using the add/remove programs or Programs and Features control panel:

  • Microsoft OCS 2007 R2, Administrative Tools
  • Microsoft OCS 2007 R2, Audio/Video Conferencing Server
  • Microsoft OCS 2007 R2, Standard Edition Server
  • Microsoft OCS 2007 R2, Web Conferencing Server
  • Microsoft OCS 2007 R2, Web Components Server

The previous steps look a little bit strange, or at least that’s what I thought when first seeing them. To confirm that these are, in fact, accurate instructions, I contacted Rick Kingslan, who is working at Microsoft as the Technical Writer for Office Communications Server. He came back with the following answer:

From what I recollect in this Disaster, the WMI data can no longer be counted on as being accurate and correct. Plus, you are – essentially – dumping the databases and recreating them.  For all intents and purposes, this leaves you in a case where all servers need to be rebuilt, roles re-applied, and server components re-built.

As a last step to clean up all of the lingering fragments of data, delete the content of the following folders:

  • Address Book Files
  • Data MCU Web\Web
  • Data MCU Web\Non-Web

All of these folders can be found in the C:\Program Files\Microsoft Office Communications Server 2007 R2\Web Components directory, assuming you installed OCS 2007 R2 in the default location. Once the removal process has been completed, we will start the reinstallation of OCS as normal, and will use the same pool name as it had before the disaster.

After the installation has been completed successfully, stop all the OCS services (if you’ve already started them), and restore the RTC database using the same steps which were used to recover the database in the DR OCS site. After the restore has been completed, you can (re)start the services.

Now that we have completely recovered our production environment, you can perform some basic validation tests before you migrate the users back to the production site:

  • Open the OCS Management Console and select the DR OCS Site pool
  • Right-click on the users node and choose the Move Communications Server Users option without selecting the force option
  • Wait until all the users have been migrated successfully. In case the migration of a user fails, perform the migration of that user again to see if it succeeds the second time. If this will fail again you might want to perform a forced migration, but keep in mind the users OCS data will be lost in this case.

Once all users have been migrated, you can change the external and internal DNS records to point back to the production environment again, and you’re back in business.

Summary

Here ends the second and last part of the OCS disaster recovery articles. We’ve seen what information is essential for disaster recovery and ongoing OCS functionality, and we’ve seen the easiest ways to back that information up.

Even better, we’ve seen that, even given some fairly spectacular disaster situations, the steps needed to recover our OCS environment are all relatively simple, and the process for different types of recovery are often relatively similar.

If you have any questions about recovering your OCS site, please feel free to comment below, and I hope you enjoyed reading these two articles.

This article was commissioned by Red Gate Software, engineers of ingeniously simple tools for optimizing your Exchange email environment. Find out more.