Documentation: Shoulda, Coulda, Woulda

Every sysadmin knows that they should be documenting everything they can; not only is it good practice, it can be the difference between a speedy recovery and a lost job when something does go wrong. Barry Morrison delves into just what you should be covering, even if you think you can't spare the time.

It’s often preached but rarely practiced, we all wish ours was better and he had more of it, and none of us seem to have time for it: Documentation.

Wesley David had a series of articles on Simple-Talk discussing high availability, disaster recovery, failover and the similar related topics. I found the series of articles after a few had been posted got caught up and followed the remainder of the series.

1385-paperstacks.jpg

As I was reading his last article, I kept asking myself, “When is he going to mention documentation? Documentation has to be in here somewhere. I know he’s going to get to documentation.” He did, and just as he did, he continued on: very quickly glossed over. This is by no means a dig at Wesley, his articles are well-written and cover a range of topics, but in my opinion he left out the most valuable piece. And here is where this article will hopefully pick up.

The purpose of this article will be to go over what I believe to be the necessary or required documentation to make a Sysadmin’s life easier. A large part of this comes from a data center move, the rest of it comes from experience as a sysadmin when something had bit me, only to make me document it after the disaster / failure and recovery. Oh, how my life would have been easier if I had that documentation. Honestly there can never be too much documentation, much like verbose logging; it’s better to have too much, than not enough.

Server Information

This seems simple or basic; right? We all have our service tag, or serial number: These numbers ensure we’re downloading the correct drivers and firmware for update purposes. They also provide another vital piece of information – a link to warranty and support.

The first thing that support is going to ask for is service tag or serial number, so they know what they’re supporting. As a part of the server documentation, you should include the level of support (Silver, Gold, Platinum, *enter level here* – in my experience, higher levels of support, often come with special support contact numbers that provide quicker responses, shorter hold times, and high-level support engineers), as well as the SLA – how soon the vendor is supposed to get parts to you. Whether it be the next business day, four hours or even one hour. Support may also ask who you are, and who the owner (business) of the server is. If you lease, or purchased through a VAR/Reseller, this may be your leasing company’s name or the VAR’s name: No matter whose name it is, it’s important that you know to ensure a quick resolution by support. But it doesn’t stop here. As a part of the necessary information for the server, we should also have the MAC address(es) of the NICs, number of sockets/CPUs/cores, the amount of RAM in the box as well as number of hard drives and any RAID configuration(s) (if any exist).

Physical Infrastructure

So I know everything I need to know about the server, that’s great and all, but it doesn’t help if we don’t know where the server is in our data center(s). If you have a small closet with only a few servers, this may not be as important to you as someone who has hundreds or thousands of servers in a very large datacenter. Information regarding the physical infastructure should include what rack the equipment is in, also what rack unit the server is located in. This will make your life a lot easier when you go hunting for a downed server, and you have no idea where it physically exists. If you have the luxury of managed PDU’s you should also know the ports assigned on the PDUs. If you don’t need to go to the data center, you may likely be using a KVM-over-IP solution or the server’s remote management port (DRAC, iLO, ALOM). As a part of the KVM-over-IP infrastructure or remote-management infrastructure, you should know how to access this. You should know the IP of the KVM switch or remote-management port as well as the credentials required to access said services.

Network Infrastructure

1385-serverfarm.jpg

Depending on your network infrastructure, you should also document what network port(s) the server is connected to, whether it is connected to a top-of-rack switch/patch panel or ‘home-runned’ back to the server farms/cores. Knowing the physical connectivity is only half the battle. If there are special configurations regarding the NICs (port bonding, failover, etc.) this should be documented as well. If the server is on a specific subnet / VLAN, its IP(s) should be documented along with netmask(s) and default gateway. If there are firewalls on the host or ACLs on the VLAN, these should be documented also. A hostname / description of the host should be forwarded to the networking team so they can label the switch port on their switch to ensure they know what port(s) they’re doing what to. It should also be noted that the configuration of the switch port(s) could be valuable (speed/duplex, LACP, etc.). If you’re lucky (or unlucky, depending on who you’re talking to) enough to have your server connected to a SAN environment, just like with the TCP/IP settings/configurations should be documented, so should the SAN/Fiber Channel ports and settings be documented (WWPN’s, Zones,etc.). It should also be noted that, if a server has had any customization of the HBA, these changes should be documented as well. There’s nothing like swapping out an HBA, only to find out it doesn’t work, only to later find out that you made a special config change for a reason. Been there, done that!

Operating System

Documentation regarding the Operating System would include what version of OS, hardware architecture, what patches are installed, drivers (even something as extensive as version, rev. level, etc. could be valuable), customizations like reg hacks or hot fixes if you’re on a Windows server.

Anything done to the core OS that isn’t a part of the out of box install, should be documented. If you have special drivers or .ini/.conf files, anything that’s been changed should be documented to ensure that if you had to install a new server, you’d know exactly what needed to exist before you installed your application(s). If you require special version of an application/driver, it would be valuable to know where those files may exist. Something that has proven valuable to me is drive letter(s), size, and available space. This is also more valuable if you have iSCSI / Fiber Channel LUNs connected to the server, or mount points on the server. These LUN ID’s and/or paths should be included in your documentation.

Application / Services

So we know what the server is, where it’s located, how it’s connected, but what is it serving? The core of what a server does: serve. Whether it’s a file share server, print server, or database server, they all host a service. These services have settings or configurations specific to your environment. Application documentation should include the version of the application, the patch level, and any hotfixes applied. It could also prove to be valuable if you have required startup/shutdown processes for applications or services documented so that anyone could bring up your service with only your documentation.

Any dependencies, service accounts, and special permissions should be also be documented. If you have ODBC/DSN settings for a database, these should also be documented, along with their software/driver version numbers, and credentials related to the connection(s).

Backup/Recovery

This particular piece actually came up in a discussion last week. How do we restore a server / service? If there are special requirements to backup or restore a service, they should be clearly defined. Your company hires you to backup their data, they fire you if you can’t restore it. If you have pre/post scripts for an application, exclusion(s) for files to not backup, these should all be included in your documentation. After a server has been brought up, it’s not a bad idea to test that you can backup and restore the server.

Bringing it all together

With all of this, we should have documentation regarding almost everything there is to know about our server. This article did certainly not include or cover everything that could be documented. Every environment is different, every server is different. But if you do something to the server, it’s a safe bet to document it. As I said at the beginning, better to have too much, than not enough. As sysadmins we’re all over worked, underpaid with too much to do and not enough time, but the time (and hair) lost by the lack of documentation makes it worth the effort put forth to spend a few extra minutes to document your build / server / environment.

If you can (and you should) include documenting into your build process, it’ll become habit and everyone’s life will be easier in the event of a failure.

Image credits: Jeff Werner (Flickr) and Sugree (Flickr).