To Fly, To Serve, To Fry Your Servers

So, the story goes that an Ops engineer walked into a data center with the necessary pass, a cheery wave and a ‘good morning’. Shortly afterwards, he made history. At around 8.30AM, British Airway’s entire communications systems went down at the height of the May holiday, forcing them to cancel flights from the UK’s two main airports at the busiest weekend of the year. The disruption continued for days, with about 600 flights cancelled, 75,000 people stranded and an estimated cost to BA of well over £100 million.

The engineer in question, we are told by BA, was a contractor doing maintenance work at Boadicea House, one of two BA data centers near Heathrow airport. Apparently, he disconnected a UPS for about 15 minutes but then returned power to the servers in an “uncontrolled fashion”. He may have “interrupted the automatic switchover sequence” between backup and generator power supplies, causing a surge that damaged servers and shut down the entire data center.

The details of what happened are still sketchy, and the investigation so far has focused on human error. “The engineer was authorized to be on site but not to do what he did,” said Chief Executive, Willie Walsh.

The corporate instinct for self-preservation has obscured the shyer instinct to reveal the entire truth. We can only speculate. However, if you put a person in a position where they can accidentally cause such calamity, then that’s not “human error” but catastrophic process failure. Normal operational practice makes it impossible for a single human error during a power restart procedure to have such catastrophic consequences.

Countless questions have been raised about lack of investment, poor testing, and the state of the hardware at BA’s aging data center, but particularly the failure of their automatic failover and disaster recovery systems. Why did a failure at one data center have such a drastic and long-lasting impact around the world? After all, companies like Google can switch out entire data centers within seconds, as a matter of routine.

There should have been instantaneous failover to one of BA’s neighboring Heathrow data centers, Comet House, or to Cranebank. The power surge at Boadicea House may have corrupted some data, which was then synchronized with the secondary data center, causing subsequent failure there too. Was BA’s disaster recovery plan properly tested? Many companies test failover only of certain applications, and as part of a controlled, staged process, but neglect to test a real uncontrolled shutdown. After the failure of their automatic failover system, why couldn’t BA perform the standard procedure of a manual failover from backups, to one of their remote data centers?

Proper disaster recovery planning and testing involves substantial risk, cost and manpower, but it is the only way to be sure that it will work when you need it to. Even in the face of the most unpredictable set of circumstances, your organization is obliged to ensure that essential business processes can continue working. BA failed to do that.

Commentary Competition

Enjoyed the topic? Have a relevant anecdote? Disagree with the author? Leave your two cents on this post in the comments below, and our favourite response will win a $50 Amazon gift card. The competition closes two weeks from the date of publication, and the winner will be announced in the next Simple Talk newsletter.