It was a rough winter for flying in and out of Heathrow as the UK experienced unusually cold and snowy weather. Unfortunately, one of my colleagues and I landed at Heathrow on 10 December during one of these storms. There was just a small amount of snow on the ground, but thousands of passengers were stuck for hours in the airport and on planes as hundreds of flights were cancelled. Many arriving flights could not get to the gates, and the passengers (my colleague was one of them) were left sitting in the planes for four hours or more. The airport just wasn’t ready to deal with the weather. The situation reminded me about how important it is to be prepared for unexpected events, especially those that affect a company’s IT infrastructure, software, and data. Outages can cost the enterprise millions in lost productivity and sales.
There are many types and sizes of events that can cause outages affecting applications used by staff and customers. A small disaster might be the accidental deletion of data from a production database. A large disaster could be a hurricane that destroys an entire datacentre. Many bad things can happen, and it’s important to have a well-designed disaster recovery strategy so that disruptions will be minimized.
Disaster recovery is not easy. It requires duplicating resources to a different geographic area that often sit idle waiting for said disaster. The IT department should schedule practice drills to ensure that staff are prepared. Yes, this is an expensive investment, and it’s not always easy to convince management that it’s necessary, especially at small companies with equally small staff and budgets. Moving to cloud services, such as Azure or AWS, can help since redundancy is baked right in. You still need to configure it properly, understand how it works, and have a backup plan that doesn’t rely on the provider. About a year ago, Amazon experienced a four-hour AWS outage that affected thousands of internet sites including Netflix and Pinterest. Human error was to blame. Many companies were relying completely on AWS and were helpless until the problem was corrected.
A few days after the snowstorm, the airport was back to normal. Travellers had rebooked on later flights, came up with a different way to travel, or possibly gave up their plans. Unfortunately, some businesses never recover after large scale disasters. I’ve even heard horror stories of companies going out of business because of corruption in just one database that hadn’t been backed up. Stories like these can keep DBAs and system administrators awake at night. Disasters can’t be completely avoided, but you can be prepared to act when they do happen.