Planning for the Unexpected
“The chances of anything coming from Mars are a million to one,” he said.
“The chances of anything coming from Mars are a million to one”
– but still they come!
Jeff Wayne: The War Of The Worlds: The Eve Of War
Every company that relies on IT to run their business is vulnerable to disaster. It requires planning to recover from disaster effectively and rapidly; the plan must be regularly tested and rehearsed. Planning isn’t just about defining strategies for recovering from a disaster, but also advising on ways of preventing disasters, and identifying means of detecting them as soon as possible.
Anyone tasked with planning for data recovery needs to be a natural pessimist with a good imagination. When preparing a recovery plan for company data, it is best to assume that, in a severe disaster, there is likely to be nothing left of the current data center or servers. I have lain awake at night, thinking of how databases could be restored after a pandemic, revolution, or tsunami. I can’t even enjoy zombie films without wondering, What consequences would there be for datacenters?
Disaster recovery planning for data is just a part of the broader context of preparing essential plans to ensure the continuity of the enterprise. Many governments make such planning mandatory. A Business Continuity Plan (BCP) is much broader in scope than the data. Such plans deal with key personnel, facilities, crisis communication, and reputation management. The Disaster Recovery Plan (DRP) is a part of a BCP, and to drill down further, the Data Recovery Plan is a component of the DRP.
To plan adequately for the recovery of the company’s data, you have to look at the broader context. It is not enough to have the backups. Recovery implies that there is somewhere to restore them. As well as the obvious requirements for hardware and networking, there are the details, so easily overlooked, that will make all the difference.
Database applications tend to need a wide range of ancillary hardware and software in order to function properly. They have data feeds, third-party components that require license keys, a whole range of functionality that requires encryption keys, passwords, DNS entries, firewall settings, and network configuration. Just getting everything to the right version level would require change-management records and software release documentation. We need to know the way that backup dates are recorded in order to get the right backups. There are also configuration files, firewall rules, format files, and structured data files squirrelled away somewhere. A data server may be in an old version, and the database might not run on later versions. Do you have the installation disks for the server, or for ‘heritage’ versions of the operating system that might prove to be essential? Can you afford to pull all the software versions you need off the internet, especially after a major disaster that swamps public networks? These are all questions that you must be able to answer confidently.
The Data Recovery Plan
A Data Protection and Recovery Plan needs to be as simple as possible, while taking into account the whole spectrum of possible disasters in various degrees of severity. It needs to allow for the time each recovery operation takes, as well as the manpower requirements to implement it. A Data Protection and Recovery Plan must also incorporate strategies for recovering, as a priority, the most essential systems, even if this allows the less vital systems to languish temporarily.
The idea of preparing written plans and instructions for restoring database systems seems like an alien and rather unattractive activity to many IT professionals. If you ask Admins what the likelihood is that they might not be around after a disaster to give pithy commands out of the corner of their mouths; they find the idea far-fetched. Of course they’d be there, just like the unsinkable Molly Brown. In reality, even relatively common events such as blizzards or flu epidemics soon prove their over-optimism, and the reason that disaster-recovery planners insist on what seems like obsessive documentation becomes apparent. It doesn’t take an attack by flesh-eating zombies to prevent the experts from taking charge. Many recoveries have to be made by staff from other departments, reading from procedure scripts and written documentation. You may be able to direct a recovery from home, but the chances aren’t good, especially if remote access is unreliable.
Any business needs to have performed a Business-Impact Analysis (BIA) before attempting to write an intelligent recovery plan. There must be agreed Recovery Time Objectives (RTOs) and Recovery Point Objectives (RPO) for each of the key business processes such as payroll systems, and accounts so that the Data professionals can plan for data recovery based on the business’ priorities. The DR planners need to refer to the Business-impact Analysis (BIA and security audit to determine the security classification of the data. If data is subject to statutory regulation, that could affect how it’s backed up and the most effective way to recover it. There is a world of difference in the cost of Plain oOrdinary offsite storage and secure offsite storage. The planners will have to meet with the technology team, application team, and network administrator(s) to make sure all views are represented.
Locating and identifying all the data
A data inventory is essential for data recovery. To recover effectively from any disaster, the recovery planners first need to know where the company data is, how important it is, what its security implications are, and how it is currently backed up. The term ‘Data’ includes what is in relational databases, old access databases, filing cabinets, notebooks, and even inside the heads of individual employees. It could include unlikely dependencies, such as a sealed envelope of passwords locked in a steel cabinet. If it’s essential for the running of the business, then it’s within the scope of recovery planners. All important data feeds need to be documented, together with the location of the secure storage of encryption keys, IDs, and passwords.
To double-check that nothing is missing, it’s a useful exercise to identify and track down the main business processes so you know you aren’t forgetting an obscure but vital system. Having discovered where all the data lives, we need to record the current data recovery strategy for the data and the quality of the production documentation. It is useful to check whether there is a data-retention policy for this data. Sometimes, you can waste time trying to restore data after a disaster that, by rights, should not even be held by the company any longer.
Service loss: Prevention, detection, and recovery
There are three types of DR planning. The first concentrates purely on corrective measures after a disaster, whereas a broader approach will also aim at preventive and detective measures as well. Strictly speaking, data recovery should take the more blinkered approach, though it is generally all combined in the one plan: This is done for the simple reason that the same research work can be used for all three types of planning.
It makes sense to correct the most serious vulnerabilities you come across. Without knowing how the company data is currently held and backed up, it’s impossible to work out how to define the corrective procedures (i.e. restoring the data in a timely manner). When checking this, it makes sense to ensure that backups are being taken in a way that makes a restore as quick as possible, and that the data is being stored on a sufficiently resilient platform.
Detective measures are intended to minimize the impact of a disaster after it starts. Although fire and flood are obvious examples, much can be done to prevent ‘domino’ disasters where a component failure causes knock-on failure of other components.
Preventing disaster to data
If the way that the data is currently stored is tempting fate, this must be flagged up and changed. Youtube has graphic footage of the destruction, by flood, of a data center sited comically in an old dried up river-bed in the eastern Mediterranean. I have witnessed the destruction of a million pounds-worth of servers destroyed by fire sprinklers sited directly above them.
Are standard best-practices in place for important data? Is the data held in resilient storage such as RAID, redundant power supplies and server connections? Is it held on virtualized storage? Is the data replicated, clustered, or locally- mirrored? Are security arrangements adequate, including intrusion-detection and logging, virus protection, and destruction of storage media on disposal to prevent data leakage? Are there provisions for problems with power supply such as surge protectors and uninterruptible power supply (UPS), as well as backup generators? Are there enough fire alarms and fire extinguishers?
Ensuring that archiving is appropriate
The backup archiving needs to be checked as well as the live data. Is there the necessary security for the backups? Does the retention of the data comply with government and industry regulations, for the encryption of backup media? Are there other reasons to keep backups other than for disaster recovery? Is it used for auditing purposes or maybe the occasional investigation of accounts? If so, does it need to be restored to any point in time, or merely to the latest possible point in time?
Off-site archiving and stand-by servers
There are several approaches to using off-site resources for recovery. Traditionally, backup computer centers have merely held tapes securely in the right environmental conditions. This is still done, often using hard drives. In addition, backup computer centers can play the part of disaster recovery provider, by maintaining ‘hotsites’ which maintain hot standbys via log shipping or replication. These providers increasingly use ‘cloud computing’. Whereas this is fine for internet services, it is less useful for corporate intranet applications within particular sites because this type of provision assumes that sufficient public communications infrastructure will remain intact after a disaster, which is doubtful.
Offsite archiving of backups is essential, even if in practice, the data may never be required.
The biggest problem with offsite backup comes when the data is suddenly required. For the recovery process to be timely, the data must be retrieved quickly in an emergency. One offsite backup ought, therefore, to be kept as close as possible to the servers so as to avoid delay. The fastest transportation method might be a courier on a motorbike carrying hard drives, were it not for the physical and security risks. However, if public communications are fragmented, then retrieval of a backup over the internet could become impossible. This problem has led to the adoption of ‘hotsites’ that can avoid having to immediately restore local servers by maintaining a hot standby that can, in an emergency, serve the data over the internet as a ‘cloud’ service outside the area of disruption in communications. This is more appropriate for new applications and web services.
More and more data is being transferred offsite via the internet. Electronic vaulting can be used to stream data to offsite backup as it is generated, usually by continuous data protection, log shipping, or replication. Either the backup data should be encrypted, or a secure file-transfer protocol should be used.
The location used for offsite backups should be in the appropriate legislative area for the company (e.g. EU, USA). Location is important for avoiding obvious risks, but it is difficult to predict how disaster will strike. A major bank in the UK once built an expensive data center in the depths of the country, a place unlikely to be hit by a tsunami, earthquake, terrorist act, or civil unrest. It was narrowly missed by a crashing Boeing 747.
Creating a backup computer center to the standards required for corporate data requires considerable investment. Offsite archiving facilities need to be manned and available around the clock, even during public holidays and weekends. You never know when you will suddenly need that offsite backup.
Offsite backup should have security measures commensurate with the importance and regulations of the data. Ideally, the premises should be inconspicuous, in an unmarked building away from city centers. Security should be organized with defense in depth, with at least two or three layers of access control with badges or biometric authentication, video surveillance, and security guards. Least-privilege access has to be enforced in all locations. All the client assets should require dual-custody access, with audit trails. All employees and contractors should have had criminal-record checks.
If data is delivered on physical storage media, effective and secure pickup and delivery procedures need to be in place with recordkeeping. Any attempt at unauthorized access to the data must be logged and reported. There should be routine precautions against data ‘leakage’, such as can happen with the careless disposal of redundant storage devices.
If your company is multi-site, then this would seem an obvious way of archiving the data. Maintaining a ‘hotsite’ in each major location is certainly the best way of recovering from many disasters, but does not entirely remove the need for an offsite backup center, particularly if there is widespread geographical disruption. Also, if the location is too obviously part of the organization, it might not avoid disasters such as terrorist attack. You can outsource the offsite backup or enter into a reciprocal arrangement with a company with a similar-sized data requirement. ‘Cloud’ storage is fine as long as it meets security requirements.
All organizations should have a Business Continuity Plan. As part of this plan, there should be a Disaster Recovery Plan that tackles data recovery from a whole range of potential disruptions and disasters. The exercise of preparing such a recovery plan will uncover data vulnerabilities that should be corrected as part of the exercise. The data-recovery plans should cover a range of emergencies in a spectrum of severity, and should be phrased as recipes, without assuming special knowledge, so that any competent, available staff can perform the recovery of data services in an emergency. These recipes must be rehearsed regularly and modified in the light of experience. Only then can you be reasonably sure of your ability to recover data in a major disaster.