High Availability or High Recoverability?

Having pierced the veil of confusion surrounding High Availability, Wesley David finds himself asked (and being asked) whether HA is worth the money it burns through. Perhaps it's more cost-effective to have a recovery process that moves like greased lightning?

For several months I laid siege to High Availability. It took three brutal articles before I could pen the fourth, supposedly “upbeat” installment in the series, and even then I could only speak of one lonely thing that high availability is good for. After the final article, my cannonade ceased. The smoke cleared. The grisly battlefield and its aftermath came into focus. An uneasy détente existed merely by virtue of all known ammunition having been expended. And then it happened.

I collated all four articles in a blog post at my sysadmin blog, and asked for feedback from my readers. Now that the entire series was finished, it could be taken in at one sitting and any weaknesses in my thoughts would be more visible. My request was granted; a spark of feedback landed on a previously unseen powder keg, the detonation of which heralded a new skirmish; that of “High Recoverability”

Rumblings of War

It first started when Steve the Hedgehog made a pointed observation concerning his own workplace:

“It’s far cheaper for the company to pay the Technology Services team to redeploy a server and its work load than it is to run a second instance.”

The larger idea of his post was that his company simply does not use high availability systems very much, because [his company] can weather the downtime without a significant detriment to their business. Does that mean that their business isn’t important? Do they not deal with rush customers? Do they not have an important market segment? Not at all. Or rather, it’s not so much a matter of unimportant business as it is a matter of simple economics.

Further on in the thread our spikey commenter reveals his specific scenario. In summary, it costs £300 in manpower to recover one of their servers, compared to £5,000 to £10,000 to implement a HA solution. As long as lives and mass quantities of currency don’t lie in the balance, why implement HA when rapid redeployment methods are already in place, and many of them also serve even greater purposes than just recoverability?

Count the Cost of Building your Siege Towers

1332-siege.png

I think high availability has been thoroughly pulverized in previous articles. However, a few points must be made about the cost-versus-value of HA in order to really extoll the advantages of so-called high recoverability.

To start with, Steve’s comment regarding the costs of HA was supported by similar thoughts from Greg “tsykoduk” Nokes and Barry Morrison. Specifically Greg suggested that valuing HA is a simple matter, starting with estimating the cost of a full day’s business outage. One will have to rely on their own byzantine formula to estimate that dollar amount, since different industries have differing seasons and patterns of cash flow. From those numbers, one can determine how much it is reasonable to spend on a HA solution. Let’s say that two to three times the cost of the outage is generally reasonable. That amount is often not going to be able to completely protect your service with HA systems, unless you’re a major retailer (or, as Greg points out, if lives are in the balance – as they were where he was once employed – you must always work to make everything as highly available as possible).

Why focus on just a single day’s outage and not two or three days? I believe that, if a business cannot recover a system – any system – within a single 24 hour period, then there are larger problems afoot. With today’s technologies, including many free solutions, any system should be back up and running, in decent condition and with recent data, within one day.

Sure, it might take an all-nighter. Yes, it might cost some overnight shipping fees. However, it should be doable. If not, find out why not, and fix those issues, because they are much more broken than any lack of clustering or failover nodes *ahem* Getting back to the point… High availability systems are uni- tasking, monolithic, cash bonfires.

A Buzzword is Born! “High Recoverability”

I admit it. I made that term up out of whole cloth. I’ve never heard it before, and will likely never hear it again, but it’s a facetious pseudo-buzzword that will hopefully get you to think about recoverability as a possible substitution for high availability.

The goal of “High Recoverability” is to ensure that any system can be brought back online under most disaster scenarios, with reasonably recent data, in just a few hours. We’re not talking about recovering from the proverbial meteor strike, nor are we considering data that cannot stand to have the latest few minutes or hours permanently lost. If either of those things is true, then you really do need a high availability solution, and should probably be designing it right now rather than reading this article.

Already on a War Footing

I posit that you and your organization already have the makings of a High Recoverability system. Oftentimes the very products that you can use to make recoverability faster are things that you already have, and that can multitask and be very cost-effective in a much wider context. If you don’t have many or any of those tools, then I’d offer that as proof that your basic technology infrastructure has major problems which extend well beyond any kind of disaster recovery or high availability plan.

In order to make a techno-environ that will make your various systems highly recoverable, we’ll need a handful of strategic battlements:

Each item is listed in roughly the order of deployment when recovering a system, though not necessarily the order of importance concerning the data involved. Also note that this list focuses on server-based systems, not appliances or other awkward hardware. If you want a highly recoverable IDS, environmental monitor or other appliance-based device, you can use the basic principles outlined here, but will likely have less overall options to choose from. Get your war face on dear friends, because we’re about to go into the breach.

Prime the Cannons!

1332-cannons.png

Recovery hardware is where we must begin if we’re considering a total machine failure. Determine how you are going to procure a platform for the OS and applications to be restored to. If you have some spare machinery around, then clean it up and test it out. If it passes muster, then keep it on cold or warm standby.

If you use virtualization in your environment, determine if you can spare the storage, CPU, RAM and other resources for the additional workload. You might even be able to temporarily place a formerly-physical load onto your virtualization platform.

If the above two options aren’t available, then make sure you’ve got a good warranty on the server. Perhaps you have next day service and can have a replacement in your building in mere hours. In the absence of a spares, virtualization or a warranty, emergency processing and overnight delivery will be your only recourse. Check with your accounting department to see if they can spare the many hundreds (thousands?!) of extra denarii you’ll need.

If you have none of the above options open to you, then you are officially cleared for a panic-driven meltdown. Seek some way of remediating that situation as soon as possible, because you can go no further. You must have a platform to rebuild your fallen server on.

Let Slip the Dogs of War!

1332-knights.png

OS deployment by hand is for puppies. You’re a big, bad, junkyard SysAdmin. Puppies spend their time shuffling around the server room with installation CDs – You don’t need no stinkin’ CDs! Choose an OS deployment option that is resilient, full of possibilities and, most importantly, is as close to no-touch as possible. Automation is everything, and there are so many options to choose from in this field that I hardly know where to start.

First, if you already have a backup product, research its capabilities. If you don’t have a backup product, then you have permission to panic. Does your backup system do image-based or just file-based backups? Is it application aware and, if so, is it properly backing up that data? Can it perform P2V restorations? How about bare metal restores? Restores onto disparate hardware? Some of these features span many different bullet points in the list to achieve high recoverability, so there’s a lot riding on your answers to these questions. Figure out your backup solution, and see if it can help in the deployment of base OS images, or even production images, applications, data, databases, etc.

If you don’t have a backup solution that can help you very much, don’t fret! You can use image deployment tools to drop a fresh base image onto the hardware and build it up from there. Look into WDS or the larger SCCM for your Windows machines, or something like FAI for Linux machines. Some open source image management tools like CloneZilla or FOG might be of help.

One way or another, you must find a way to deploy an image onto a new piece of physical or virtual hardware. Preferably, that image will be one of the backed up production server. However, you can get by with plain images (hopefully prepopulated with correct drivers and updates). In a best-case scenario, your backup tool will have the ability to protect an OS, together with applications and data, and then recover it all to new hardware. If that’s the case, you might not need to go much further in this write-up, but read on anyway for some general tips.

It’s Just a Flesh Wound! It’ll Patch Up!

Let’s talk about good patch management, and I don’t mean extending the life of your weekend overalls. All OSs need to be patched, and even just a few weeks of negligence typically means that the outstanding updates and hotfixes can be a bit overwhelming. One can manage the patches on each server locally or, better yet, network-wide from a central console. This latter option allows for reporting as well as a better method of selecting which patches are to be deployed and which ones should be held back. More to the point, speed is of the essence, and hand-patching is not fast.

Depending on the applications that you are trying to restore after a crash, the patch management system could actually come into play at several points on the recovery timeline. Sure, your OS needs to be patched, but so will any applications. If you already have the patch configuration information for each OS and application saved in a patch management system, you won’t need to worry about which patches to apply and which ones to avoid. It just happens like magic!

One way or another, a good network needs patch management on it. SCCM, PDQ Deploy, IBM Tivoli Endpoint Manager, Lumension Endpoint Management and Security Suite… whatever you prefer. Whatever weapon you’ve got, just wield it often and wield it with skill.

Deploying the Foot Soldiers

1332-footsoldiers.png

Automated software deployment not only helps shorten disaster recovery time, but also helps free up your daily schedule from manual updates of Adobe Reader (assuming you’ve taken away admin rights from everyone). Your next recovery goal in the outline above is to get your important applications back online. After all, the front line of the corporate battle is fought using applications. Everything else is just supporting infrastructure. As you might have guessed, how you deploy the business’ applications can vary just as much as the methods used in the other processes we’ve looked at.

In some cases it might be best to simply deploy the applications by hand. In others, your backup software might be able to restore them automatically. Yet another option is an application deployment product like SCCM from Microsoft (for your Windows machines, obviously). Thankfully, deploying packages is fairly easy in Linux… provided you have a decent package in the first place.

It’s worth noting that many enterprise patch management tools (like the ones mentioned above) also have application deployment features. Or rather, it’s more accurate to say that many application deployment tools also have patch management features. If you have one, you likely have the other. Products that multi-task are a good thing, especially when in meetings with the heads of the financial department (or as I affectionately refer to them, Yon Berzerkers of Currency).

Hardly a network of more than a few PCs is without some form of application deployment. Perchance you have heard of a trifling little system called “Active Directory?” My point is that, if a significant amount of your time as an admin is spent scuttling around the office installing software for people, then you have a serious problem that needs to be solved regardless of any “high recoverability” hopes.

Once again, speed is everything, and automatic deployment / centralized management is fast. The best application deployment tool is one that makes no-touch deployments possible and relatively easy. “No touch?!” you bellow. Yes, I know that’s a little tricky at times, but it’s made all the easier by our next point…

Sharpening your Swords

“Application configuration backup and restoration” – doesn’t that sound absolutely riveting? And yet paradoxically, life without them is rather exciting. Nothing triggers 180 heartbeats per minute like uttering the phrase “Microsoft Dynamics shouldn’t be acting like this!” Merely dropping the application onto the server leaves you with a rather blunted implement. Your whetstones are application-specific configuration backups.

All of the peculiar settings and changes you need to make to the configuration of an application may be possible to reproduce in an automated fashion. Options can be recorded in your application deployment tool, or inside of configuration files that are later applied to the installation. In some sorry cases, applications just can’t have their configurations backed up easily (shame on the vendor! Make sure to call and request that feature).

If a stubborn application that cannot have its options backed up is your plight, then you should at least maintain a thorough set of documentation that records each check box and radio button, every path and environmental variable, each length of bailing wire and strip of duct tape. Is it tedious? Yes. Is it beneficial? Indubitably.

I like to use wiki software and make bulleted, hierarchical lists of each option chosen on each tab of each dialog box. I comment each line of each configuration file and make sure that they have a thorough introduction header. As a result of this tedium, reproducing my environment is now that little bit easier, even with inflexible applications.

Whatever you do, make application deployment as automated as possible. For example, write scripts that daily back up various applications’ configurations.

Lock and Reload!

1332-reload.png

Your application is now installed, patched and tweaked, but there is still no business data in it. You must now perform the restoration of the data that you undoubtedly backed up sometime before the server crash. Depending on your application, this step might have to take place before the specific configuration changes, or it might even need to be done during the application installation itself.

Of course, the ability to restore data hinges on the ability of your backup scheme to take proper backups. Make sure that your backups are fresh, consistent and not corrupted. One-week old backups are not usually a happy thing for a CRM database or a product catalog. 15-minute old databases are acceptable for many situations… unless they’re inconsistent, corrupted and otherwise stomped on. Test your backups! (but that’s a separate discussion)

At whichever point in the process that data restoration should take place, it is the most valuable part of the equation (from a business perspective, anyway). For some applications it might be as simple as a shell script calling rsync. In other cases it could fall upon the shoulders of a more traditional backup agent to restore the files. The preferred data restoration process will be the one that contains the freshest data and is most automated (and most practiced; you do practice restoring your data, right?).

Now that the data is back in its proper place, you’re almost ready to enter the fray.

Forward the Script Brigade! Charge for the Servers!

Python to the left of you. PERL to the right of you. PowerShell in front of you. I’ve mentioned scripts throughout this article, and they are the glue that keeps things tightly sealed – the twine that holds it all together. Most importantly, they save vast amounts of time, and ensure that you don’t forget to perform your lengthy list of checks, tests, tasks and other miscellany.

The final step needs to be any port checking, setting analyzing, gateway pinging, IP address auditing, account confirming, site-probing gallimaufry that needs to be inspected, and set aright. This can include making sure that you have database connections, ensuring that you’re visible on the network, scanning log files for certain text strings that spell early trouble, setting SNMP strings, changing paging files, turning on remote desktop or VNC servers – anything that could be considered to be finishing touches on a server installation that is about to be rolled into production. Much of this could also be done in your enterprise management software like SCCM or Puppet.

However you do it, make sure that you put the finishing polish on the installation so that it’s ready for its battlefield initiation. Once all of that is done, you should have a perfectly working replacement server that is just as good as its fallen predecessor.

“In War There is no Substitute for Victory”

1332-loneguy.png

Alas poor high availability, a valiant fight does not require victory. As you have been shown, it is possible to develop a series of tools that can help you to deploy a new server, applications and all, in mere hours. With proper scripting and a bit of good fortune (in the form of script aware applications) you could whittle the recovery time down to mere minutes.

If you noticed, none of the tools listed above are anything special. They’re not dedicated to just one system or just one feature set. Each tool can be used for virtually any PC in your building; server, laptop, desktop and even mobile devices. If you fine tune your systems for rapid recovery, you’ve benefited many different areas beyond just disaster recovery / high recoverability.

Even if you don’t yet have a juggernaut of a tool (like IBM Tivoli or Novell Zenworks), the considerable price of deploying it can more easily be justified than the same dollar amount being spent on a dedicated clustering product. You have a far better chance of getting funding for enterprise resource management systems than single-purpose high availability systems.

Getting Right to the Point

Congratulations on making it this far – you deserve a long-service medal. But before you strap on your sabre and leave for the front, let’s take one last look at your battle plan, without the military jargon:

  • Recovery hardware (or virtualware) and/or good support contracts

    You must have a platform, physical or virtual to rebuild your fallen server on. Find the spare hardware, provision the extra server resources… just make sure you’ve got the foundations ready when you need them.

  • Slick OS deployment

    Choose an OS deployment option that is resilient, full of possibilities and, most importantly, is as close to no-touch as possible. Automation is everything, and many existing backup solutions support OS imaging and automated deployment. Look into WDS or SCCM for your Windows machines, or something like FAI for Linux machines. Open source image management tools like CloneZilla or FOG might be of help.

  • Good patch management

    Depending on the applications that you are trying to restore after a crash, the patch management system could actually come into play at several points on the recovery timeline. If you already have the patch configuration information for each OS and application saved in a patch management system, then it just happens like magic. Investigate systems like SCCM, PDQ Deploy, IBM Tivoli Endpoint Manager, Lumension Endpoint Management and Security Suite

  • Automated software deployment

    In some cases it might be best to simply deploy the applications by hand. In others, your backup software might be able to restore them automatically. Yet another option is an application deployment product like SCCM from Microsoft (for your Windows machines, obviously). Thankfully, deploying packages is fairly easy in Linux… provided you have a decent package in the first place. Even Active Directory is useful in this situation. Once again, speed is everything, and automatic deployment / centralized management is fast

  • Application configuration backup and restoration

    All of the peculiar settings and changes you need to make to the configuration of an application may be reproducible in an automated fashion. Options can be recorded in your application deployment tool, or inside of configuration files that are later applied to the installation. In some sorry cases, applications just can’t have their configurations backed up easily, and documentation is your last recourse. Whatever you do, make application deployment as automated as possible. For example, write scripts that daily back up various applications’ configurations.

  • Data backup / restoration

    The ability to restore data hinges on the ability of your backup scheme to take proper backups. Make sure that your backups are fresh, consistent and not corrupted. One-week old backups are not usually a happy thing for a CRM database or a product catalog. 15-minute old (verified) databases are acceptable for many situations. Methodologies for restoration will vary; for some applications it might be as simple as a shell script calling rsync. In other cases it could fall upon the shoulders of a more traditional backup agent to restore the files. The preferred data restoration process will be the one that contains the freshest data and is most automated (and most practiced).

  • Some spit and polish in the form of scripts to make it all shine

    The final step needs to be any port checking, setting analyzing, gateway pinging, IP address auditing, account confirming needs to be inspected, and set right if awry. This can include making sure that you have database connections, ensuring that you’re visible on the network, scanning log files for certain text strings that spell early trouble… anything that could be considered to be finishing touches on a server installation that is about to be rolled into production. Much of this could also be done in your enterprise management software like SCCM or Puppet. Once all of that is done, you should have a perfectly working replacement server that is just as good as its fallen predecessor.

If you have weighed high availability in the balances and found it wanting – or rather, found your budget wanting – then you have little excuse to not think in terms of high recoverability. Most of the tools you’ll need are ones that you likely already have on hand. It’s probable that they’re not being used to their fullest extent. If you take to the project with alacrity, you’ll speedily come thro’ the jaws of server death, and back from the mouth of downtime hell.

Note: This article has generated a response, Documentation: Shoulda, Coulda, Woulda by Barry Morrison.