The worst nightmare of any Exchange Server administrator is when the Exchange Server databases will not mount or there is a hardware failure resulting in mail flow stopping. Even seasoned Exchange Server admins break into a sweat when they hear that a database is corrupted, or that the server will not send or receive any email. I have been through this many times and I believe that most Exchange Administrators have as well. Even when we take all reasonable precautions, and do everything by the book, there is always a chance that we will get a call in the middle of the night to say that somehow the mail server is no longer accessible.
In Exchange 5.5, Exchange 2000 and Exchange 2003 days, the only hope of getting back up and online when a whole server failed was to use the backups. This was tedious, complex and had a low success rate. This method of recovery would take hours for an average sized database, during which time everyone from CEO to the guy at the reception desk will be expressing veiled (and not so veiled!) disappointment with your performance. Just yesterday you were their best buddy and now you are the target of their frustrations. So not only do you have to be up to date with the backups and follow them religiously, but also when we are attempting a recovery we are always under pressure. So it becomes very important that not only should we be confident about the recovery process, but also we need to be able to get it done with as little downtime as possible. Fortunately, now there are other options which allow us to do a soft or hard recovery without much fuss and with minimum downtime.
During the last few years, third party Exchange Server high availability and disaster recovery tools have flooded the messaging world. They provide alternatives to traditional tools and to clustering. The only hope of high availability before Exchange 2007 was to put a cluster in place, with a minimum of two nodes. And even then, the complex licensing system that Microsoft employs makes it really hard to sell the idea to management to get the required finances. Besides the investment in hardware, the required investment in software and storage was tremendous. One of my colleagues said that it was better to have a day without email than sending the clustering proposal to senior management. Not that the third party tools are cheap, but they cost considerably less and have a flat licensing system.
Limitations of Native Exchange Server Recovery Tools
Email has become the ultimate communication method in the corporate world. There have been instances when people working in adjacent cubicles email each other their lunch plans instead of just standing up and delivering the message verbally. How many times has it happened that the Information Store had to be brought down for maintenance and you had to re-schedule it again and again. I bet it never went on schedule, all because of the fact that nobody wants to be parted with their email for even a couple of hours. And how about un-scheduled downtime, say for example some firewall genius modified an access rule on the firewall and mail is no longer available to Blackberrys or to push mail devices. Naturally, mail would not be accessible via OWA as well.
How many trouble tickets and panic calls did you have to handle? I mean when this happens people always dig up the fact that this had happened before as well and I should have learnt my lesson. All those people in administration, HR, Engineering and Sales do not want the email to be down even for a few minutes. So email has to be up all the time without any break in service. If unfortunately, the mailbox store goes corrupt and now has to be restored from backup, how much downtime are we looking at? For a 100 GB store with fiber channel storage connected to it, would require at least on an average 6 hours to restore and we all know the success rate of such operation.
Let us take a look at the native tools at our disposal for Exchange Server store or server recovery. If it is just a matter of mailbox store corruption, we can use Eseutil or IsInteg to diagnose the problem and try to fix it. Even the simple process of dumping the checkpoint file (*.chk) contents would require at least 20-30 minutes. And the time is invaluable, and if a soft recovery is possible, then it would take us another 45-60 minutes to get the database back online and the mail services restored. So the limitation of Native Exchange Server recovery tools and procedures is the time they take. Even then success is not guaranteed.
Advantages and Disadvantages of Third Party Exchange Server DR tools:
Although third party tools such as CA XOSoft WanSync, Acronis Exchange Server Recovery and so on provide administrators much easier interfaces and environments to work with, they come with their own baggage as well.
While working on these technologies, it came as a surprise to me that although they all have the same objective, no two products work the same way. Each product has a different feature set with varying amount of administrative overhead.
Benefits of the Third party utilities are as follows:
- Most of them will provide automatic failover similar to what is provided by an active-passive failover cluster.
- Continuous Data Replication happens to standby server, hence reduces email loss in case of disaster.
- Freedom to switch servers as and when required.
- No hardware dependency. Standby Server does not need to match the hardware specifics of Active Server.
- Eliminates the need for streaming backups.
- User initiated or automatic backward replication after master server is back online.
- Scheduled syncs; avoid rush or peak hour sync and schedule them in non-peak hours.
- Provide Site resilience by synchronizing data over WAN.
- Near automatic mail route redirection (based on DNS, Routing Groups or SMTP connectors)
- No Client re-configuration required.
- Intuitive GUI Administration consoles.
Disadvantages of Third Party Utilities
- Not widely documented or implemented.
- Very limited resource availability for these products, if something does not work troubleshooting it would be extremely difficult.
- These tools require lot of nursing and close monitoring.
- Increase in LAN or WAN traffic.
- You get shuffled between Microsoft Technical Support and Vendor Technical Support. In the end you get to figure it out yourself.
About Standby Continuous Replication (SCR) and Clustered Continuous Replication (CCR)
If you have already migrated to Exchange Server 2007 and have the appropriate hardware and software available, then please go for it. Particularly, SCR would provide most of the features of third party DR tools at no extra cost and it comes bundled with Exchange Server 2007 SP1, but in my experience about 50-60% of Exchange deployments are still based on Exchange server 2003 SP2. CCR and SCR both have their limitations which are listed below:
- Only Mailbox Servers can be clustered, so remaining server roles have to be recovered manually.
- CCR has to be on the same subnet, so it is not site resilient. Hence cannot be considered a right DR solution.
- CCR, LCR and SCC are high availability solutions not DR solutions.
- Limited to only one database per storage group.
Guidelines for Selecting DR Tool
Due to so many choices available right now for Exchange Server DR Solutions, hunting for the right kind of DR solution can be daunting. Some solutions offer simplicity, but are less effective while others require lot of looking after, but are very effective. Some are really expensive while some are considerably cheaper. So what to look out for when choosing the right kind of DR solution? Following are some recommendations which should help you decide the best possible DR tool:
- Choose a solution that offers continuous replication, not just scheduled or streaming replication.
- Must have automatic and manual switchover / failover options.
- Must be application aware, that is, if MSExchangeMTA service is down, it should try to restart the service instead of rolling over to the standby server.
- Must have more than one Network traffic redirection possibilities, for example a solution should offer us not only DNS redirection ,but also some other option like changing IP Address on Standby server to match master server in case of a disaster.
- Must be aware of IIS, DNS and AD services related issues. So in case of disaster, these services can be automatically switched over to the standby server.
- DR solution must have near zero Recovery Point Objective (RPO) so that users should not notice any significant amount of downtime.
- Must be site resilient, that is, it should offer replication to a remote site which is not part of the same subnet as that of the Master Exchange server.
- Should offer active notifications and administrative alerts for monitoring replication hygiene
- Should offer fire drill options, that is, automatically test the standby server periodically, to ensure that it can be recovered if the master server goes down.
- Must offer Master Server recovery after the switchover to Standby Server, i.e., offer reverse replication so that master server can be functional again.
Now that you have decided to deploy DR tool of your choice, here are some recommendations to make deployment smooth, operation easy and recovery faster:
- Despite everything, never forget to take a full backup on a daily basis.
- Always have more than one internal DNS server with proper forwarders configured.
- Make sure that DNS response time is within acceptable limits.
- Periodically test LDAP query response time.
- Monitor bandwidth, both on LAN and WAN, closely.
- On Master Exchange server, have only the Default Web Site configured. Do not configure any additional Web servers on that machine.
- If connected on the same LAN, put master and replica servers on separate LAN switches.
- Always manage the DR scenario from the Replica Server.
- Configure notifications for events related to DR scenario.
- Configure notifications for the services related to DR tool, on Exchange Server monitoring option.
- Do not modify the scenario too often as it could lead to database corruption on the standby server.
- In all probability the documentation about the product will not be available online, so give the vendor a call and have them send the documentation to you (it worked for me).
- Again, never forget that full Backup.
Exchange Server Administrators, like everyone else, need normal 6-8 hours of sleep daily and it may be that they want to catch a movie sometime. It does not have to be always finding that lost email or tracking where it went. Most of the tasks of an Exchange Server admin revolve around creating, deleting, managing and delegating. Backing up and monitoring the Exchange Server is also one of the core tasks, but it is not the only one. So sometimes we usually do not pay much attention to how we are going to recover the server in case of a disaster.
I can confidently say that 8 out of 10 administrators rarely test their backups. So when a disaster strikes we are often caught unaware. And actually doing a fire drill on Exchange Recovery Scenario is hardly ever done, maybe once a year ,but you never know when that e00 log goes missing and you have to bring your Exchange Server back online before CEO makes a visit to your cubicle (and believe me – that is not pretty). On average, messaging infrastructure design changes every three years, and when it does, make absolutely sure that you have proper Disaster recovery plans in place. Investing in third party tools provides you flexibility to manage your DR scenario the way you like it and it can reduce the anxiety of recovering Exchange Server from a disaster. Third party DR Tools might not be perfect, but if used carefully, they can provide you a solid DR plan.