I have recently attacked High Availability over on SysAdmin-Talk by illustrating seven ways that it can help you to fail in some pretty spectacular ways, and I’ll soon be enumerating a number of career pitfalls that HA will not help you to avoid. Yet the more I dig into the murky swamp that is High Availability and data management terminology, the more I find misconceptions lurking like sinkholes to consume the unwary. Thus, I offer myself up as your guide through this hazardous territory, where an honest misunderstanding could land you in rather hot water when you least want it. While I’m all for High Availability if correctly applied and understood, some of the nonsense I read about it brings me out in a rash, so I’m not done shredding the false notions of high availability yet! In this chapter of my continued brutality against HA misunderstanding, I’ll explain seven things that high availability is not.
I’ll say up front that you may sense a recurring theme in these points fairly quickly, and that has something to do with the nature of the misconceptions that abound. Nevertheless, get some tincture of iodine ready because this might sting a bit.
- High availability is not a data backup. Backups are archived sets of data extending backward on a timeline. You can rely on them to restore historical data in the case of corruption, accidental deletion or to resolve disputes. However, High availability, as a concept, is only concerned with what is happening at this immediate instant. It has no recollection of what has happened mere milliseconds in the past, but is only concerned with the infinitely slim moment of time that is the ephemeral “now”. Consider RAID for a moment. Most SysAdmins will agree that RAID is not a backup, and will swat a Jr admin with a rolled up newspaper for not adding a new server to the backup system merely because “Hey! It’s RAIDed!” High availability is not a substitute for backups and a system of archiving those backups.
- High availability is not a Disaster Recovery Plan. Also known as an IT Service Continuity Management (ITSCM) plan if you’re feeling loquacious or are an ITIL fan, Disaster Recovery (DR) admittedly has a slightly varied meaning depending on where you get your definition from. Some say it involves the policies and measures necessary to recover a business’s IT systems from a state of disaster to the operational state prior to the disaster. Others say it consists of the policies and technology needed to merely continue operating after a disaster (with little mention of recovering to the full resiliency that existed before the disaster). Either way, High availability systems are not, in themselves, a DR plan.
For example, If the HA solution is in one geographic location and the disaster in question is one of geography (earthquake, for example), then there is no longer a HA solution. If the HA solution is geographically spread out (geo-clustered, for example) and one site is destroyed, then the HA system merely facilitates a continuation of whatever business systems are running on it. However, depending on which definition of DR you subscribe to, the business is still in a state of emergency, and will not return to a fully normal state until the HA system has a failover partner brought online and synced (you can guess which definition I prefer).
HA itself does not help the return to normalcy. In a way, it’s like an airbag deployed in a car accident; it saves you from worse consequences, but doesn’t help repair your car (or body) and get you back on the road. Furthermore, HA does not protect against all disasters. A disaster concerning data integrity is one situation that HA clearly does not have a positive effect on – You’ll merely have a highly available corrupted database. To recover from that, you’ll need backups – see the point above. HA can be a part of your DR plan, but it is not the whole plan.
- High availability is not a Business Continuity Plan (BCP). If high availability does not equal disaster recovery, it is even less adequately described as a Business Continuity Plan. However, we are again entering into the murky waters of shifting definitions. BCP has been used in some texts to mean what some would call a disaster recovery plan; but it can also be used to define a concept that is a parent to a DR plan. BCP, like disaster recovery, covers topics both inside and outside of IT’s realm, yet BCP is more all-encompassing than DR. As IT people, we can get a bit techno-centric, especially if you work for (or own) a small business, but the technology != business, and as such, protecting it does not necessarily protect the business as a whole. High availability, even if properly implemented, plays only a small part in that larger plan. Don’t think that, just because your systems are mirrored, replicated, protected, backed up and ready for nuclear warfare, your business therefore has a BCP. Personnel management, location preparation, authority chains, resource management and a myriad of other factors need to be taken into account, documented and disseminated before you can be said to have a BCP. Systems humming along beautifully are of no use if the people that use them are flailing around in a panic with no idea of how to handle the emergency. Disaster recovery is techno centric, Business Continuity Plans are business process centric.
- High availability is not a Data Protection Plan. A data protection plan, or DPP, is a complex equation that takes into account the Recovery Point Objective (RPO; How much data you’re willing to lose), Recovery Time Objective (RTO; How long you’re willing to wait to recover), lost worker productivity and lost revenue due to downtime, among other things. A DPP also assumes a disaster that is bad enough to require data restoration. Your cluster can help prevent the need to enact your DPP in some circumstances, but it does not replace a DPP. A DPP is still necessary for inconsistent or accidentally deleted data. Remember, HA does not protect the data, it protects the service’s availability, and nothing more.
- High availability is not your Risk Management Plan (RMP). Once again, HA is certainly a part of a risk management plan, but it can also only mitigate certain risks. HA does nothing to mitigate the risks of data inconsistency, a lack of security or shoddy compliance. Are you seeing the trend here? HA is about service availability, but the data itself that is manipulated by that service gains no benefit other than its high availability. Okay, there might be another exception: the data has a higher degree of protection from inconsistency borne from an abrupt shutdown. Other than that, the data is unprotected by HA.
And let’s pick on some possible misconceptions of high availability systems within some of ITIL’s Service Delivery category:
- High availability is not your Service Level Management Plan. Practitioners of ITIL, straighten up. If the only Key Performance Indicator (KPI) that you’re required to monitor is “uptime”, then perhaps functioning HA can be the extent of your SLMP. All of those individuals whose only KPI is uptime, please stand up! Exactly. Service level management usually includes at least a few quality of service metrics, upon which high availability has no effect. SLMPs are comprehensive arrays of metrics of which service uptime is only one.
- High availability is not your Availability Management Plan. “Wha… what?!” some may be saying. If HA is all about service availability, then how can it not be your service availability plan? From both a practical and an ITIL standpoint, redundancy and failover are but parts of service availability. They might also be smaller parts than you first realize. Other factors that play a role in determining your SLP include recoverability (can the cluster nodes come back online quickly and accurately?), security (is the cluster vulnerable to exploits or physical harm?), and maintainability (is it as easy to administer as doing the Boogaloo on icy marbles?), to name just three. Don’t try to take shortcuts around your availability initiatives, and respect the ITIL suggestions. Thorough Availability Management Plans include more to worry about than just service uptime.
Yes, I killed the high availability horse early on in this series of articles, and yes, I continued thrashing it well after it had expired. However, I’d rather be accused of over-emphasising such an important point than thought to be ignoring it. I hope I’ve put high availability in its proper perspective, and also elevated many of the surrounding considerations that need to be taken into account when making a system highly available.