Once upon a time, when I was knee high to a help desk position, I witnessed a near fatality. Not of a person, but an organization. An entire business was nearly hewn down by the sickle of shoulda-woulda-coulda.
The organization that I was working for was growing rapidly and stretching the practical limits of its suite of off-the-shelf customer relationship management applications. One person was solely responsible for any application customizations using .NET and tweaking the SQL Server database that the CRM package used.
The database server infrastructure was typical for a small to medium sized business. There were only two independent SQL Servers that served the entire organization’s RDBMS needs. The reason for having two was simply to split the workload of multiple applications and not for any data redundancy or failover purposes. The servers were curated by one SysAdmin who was spread out over everything that plugged into an electrical outlet and one network admin who handled anything that touched the internet.
The developer responsible for developing and modifying the CRM package had to constantly be working with the database, documenting its idiosyncrasies, and tweaking it to meet our expanding needs. That database was a maze of twisty relationships (and not just of the foreign key kind either, but I’m digressing too far) and none of them were alike. It was a tedious and often frustrating job.
Of course, when working on changes to the application and its data, a test database was used. It was refreshed to contain the latest information from the production database every so often. It wasn’t important to keep test data completely bleeding edge. As long as the table structure was the same as the live database, things were great.
One tedious day, a certain DML statement was run against the test database. The statement was required in order to shift the structure of the database for a new workflow. We could argue over if that practice is ever a good idea, but I’m not in the mood to weep uncontrollably. That DML statement apparently carried with it some poor logic because it dropped or otherwise destroyed every table in the database. It was a spectacular event; a morbidly beautiful performance. Once the rush of fascination ebbed, however, it was time to reload the test database and try again.
Except, after the developer did a double-take and engaged in a few seconds of eye-rubbing, he saw that the test database was perfectly fine.
Ring! Ring!
Desks phones that ring seconds after a destructive operation is performed are to be noted as telling signs. “Is the server down?” asked one of our chipper customer service reps.
In the course of my career, I have discovered some visual indicators that almost universally spell trouble in an IT department. The first is smoke. The second is flashing lights next to a container of halon. The third, and arguably most dangerous, is a developer in a server room. Especially if the developer is asking about backups.
Within a minute of the fateful DML statement being executed, the developer, Sr. Systems Administrator and our Network Engineer had converged in the server room. To the credit of that IT department, no one was panicking. No one was shouting. Not one accusatory or angry word was ever heard during or after the ordeal. We had problems to solve.
Defeat Snatched from the Jaws of Victory
“Not a problem! We’ve got backups!” the Systems Administrator said. I watched over his shoulder as he pulled out the rack monitor, clicked a button on the KVM switch and logged in to the SQL Server in question. Of course we had backups! Of course they were automated! Of course things would be fine! This situation would undoubtedly be but a brief recess in the day’s scheduled activities. He investigated the SQL Server instance to determine what the recovery process would be (no, we didn’t have written, much less practiced, disaster recovery documentation).
It was then that the School Marm of Swift Kicks to the Posterior mordantly instructed the IT department in the ignobility of simple recovery mode. Yes, when we checked the settings of the database we noticed that it was set to simple recovery mode. That meant that the database’s transaction logs could not be replayed into the main database file once it was recovered. At that point we were therefore assured of losing all transactions since the last backup which had been taken the night before. A morning’s worth of customer interaction was gone. We were abashed, but at least things could have been worse.
(As a side note, understand that you’re already in a bad situation if you’re finding solace in comparing the current state of affairs with theoretically worse situations. Make special note if the only theoretical situations you can think of involve meteors, aliens and/or bubonic plague.)
It was then that to the School Marm’s injury, insult was added. Some fellow named Murphy barged into the situation and started laughing maniacally while pointing to the backup archives. Each one of the daily backups that we had on hand was inconsistent. Even worse, we only had a handful of archives to check because only a few days’ worth of daily database backups were retained as part of the backup job.
This part of the story involves some hand-waving because I don’t believe anyone ever figured out what was wrong with the backups that had been taken. Apparently the size of the backups was fine, so some kind of data existed. It wasn’t the old “the-backups-are-really-just-symbolic-links!” routine that I’m familiar with. We weren’t using any fancy backup program that had been modified awry. We simply used the SQL Server Agent to take nightly backups and place them on a file share. Perhaps the corruption was due to recent volume moves as we settled into our new SAN. The SAN freed up more space on NAS and DAS appliances so files and volumes were being shuffled hither and thither. Perhaps it was a bad RAID card trashing files or a very, very unfortunately located portion of bad sectors. No one knows for sure. It was truly bizarre and exemplifies that backups are nothing without verifying them through proper restoration practice.
Each pair of pupils in the server room constricted to specks as we realized what this meant. We had no viable backups. There were no means of returning to a consistent database. The only options that remained were pursuing a painstaking, low level reconstruction from the oddly inconsistent database backups or immediately shutting down the server, removing the hard drives, cloning them and shipping the RAID set to a data recovery service to hopefully find the filesystem where the pre-mashed database once resided. Neither option was anywhere near a sure path to recovery. Both options were frighteningly expensive, both in terms of the procedures themselves and the time it would take to complete thus shutting the business down for a week or more.
“What am I witnessing?” I asked in my mind as the drone of servers filled the still air. I occasionally noticed people peeking through the 6-inch wide pane of clear glass that ran next to the full height of the server room’s locked doors. They almost appeared to have thought bubbles over their heads. “What are they doing in there? Why don’t they just reboot the server already?”
This was not a fly-by-night company. The organization made 8 figures in USD revenue per year and had tens of thousands of customers in nearly every continent and country across its 15 year history. As mentioned before, we were growing rapidly and trying to do our level best to manage the growth. However, a “best effort” at managing the systems was meaningless if such a chain of events had been allowed to happen. The tone in the server room made a state funeral seem airy. We were the first witnesses to the killing of an organization and possibly a few careers. It was an eerie feeling to see such a stark reality being realized and know that on the other side of the walls was a few hundred people completely unaware that they might be unemployed in a short time.
Victory is a Dish Best Served While Still Employed
“The SAN!” the SysAdmin shouted and flicked across the server room floor like he had been galvanized. Only months earlier we had implemented the organization’s first SAN, having recently grown to the point where DAS and NAS was impractical for several of our services. One of those services was our SQL Servers. iSCSI LUNs had been carved out and presented to the database servers for our most precious and I/O intensive databases to be stored on.
The SAN had some fancy features that we had all been fawning over for weeks. One of those fancy features was periodic volume snapshots. Those snapshots could be explored for files. However, could a SAN snapshot provide a consistent copy of a SQL Server database? We weren’t sure.
After searching for and extricating the most recent data files from the volume snapshot, the database was mounted. DBCC CHECKDB plowed through the data. No errors were returned. Some quick SQL queries were made to pull data that the developer knew had recently been added. The data seemed to all be there and in fine form. After a bit more exploring, we were satisfied that the snapshot had saved us from losing all but the last few hours of data. It was actually a better recovery than had one of the daily backups been consistent and mounted. We reconnected the database, got our users back onto their normal daily work and began the self-effacing process of learning from our mistakes.
The Light at the End of the Tunnel Wasn’t a Train
Fortunately, the organization survived and so did everyone’s jobs. Very few people ever knew how close to ultimate disaster the organization came. The leadership was gracious, knowing how crazy the growth had been and that nearly every department was running on the fumes of energy drinks in order to keep up. They simply expected the IT department to change whatever was necessary to prevent the same mistakes from happening ever again. There was an awful lot to learn from this situation. Here are three major points to take away from this so that you can avoid a terrible fate.
First, treat production data like the precious thing that it is. It is the family jewels, the family house and the family itself. Without it, there is nothing. Protect it jealously. Separate your development environment as far as possible from your production environment. That means separate servers, separate networks, separate accounts, separate everything.
In the course of working on your test environment, perform your actions with user accounts that do not have access to production resources. In fact, explicitly deny those user accounts access. Make the exportation of production data as safe, fresh, automated and one-way as possible so that, even if there’s no technical possibility of accidentally using production systems, there is never a temptation to intentionally allow an untested operation “just this once.”
Second, backup your data like you know you should. Put the effort into getting things right and understanding what types of backups you are taking. Are you taking full backups once a day? Differential or incremental? Both differential and incremental? Are database snapshots being taken and if so what level of backup is that offering? Are volume snapshots being taken by the storage system and if so do they support snapshots of your specific RDBMS files? Do you need CDP level backups? Is there a cluster involved and if so what impact does its architecture have on the backups being taken?
Let’s pause here for a moment and discuss the confectionary bit of technology that saved the day: SAN volume snapshots. If you’re fortunate enough to have your databases on a SAN volume, consider that you may have some extra threads in your safety net, but that’s about all you can take comfort in. SAN snapshots are very proprietary things and may or may not support the consistent restoration of your RDBMSs data files. They also might not be turned on and if they are, you need to pay particular attention to things like what level of point-in-time recovery is allowed, how much history is preserved and if the space that snapshots take up is dynamically allocated (thus allowing for a sudden deletion if space is needed elsewhere). In other words: SAN snapshots are not proper backups. Even if they turn out to be useful, you must pay special attention to their sundry options and caveats. They are a tasty morsel that can be snacked on if the situation calls for it, but nothing more.
There are virtually innumerable enumerations that one could consider when inspecting their systems’ backup procedures. However, the questions above cover a lot of ground and will get you thinking in the right direction. Just don’t stop with those questions. Make sure you’re actually taking backups! Backup early, backup often. Nevertheless, even that is insufficient, because you must…
Thirdly, test your backups! In fact, stop referring to your backups as “backups” and refer to them as “restores.” If you haven’t actually verified and restored your backups, then refer to them as “alleged restores.” You must go through the entire steps of a full restoration procedure to be certain of their viability. Certainly you can automate parts of a restoration test. In fact, that automation can come in handy in the event of actually restoring from a disaster. Just make sure to get your hands on the test restoration process as well. Document it thoroughly in minute detail. Have others review your documentation to make sure it’s cogent.
I recently encountered an organization that performs a full restoration procedure for their systems every six month. They even throw parties when things go well. Fortunately, their last restoration drill resulted in much pizza and confetti. The same can rarely be said for most organizations.
Does all of the above sound like a lot of work? It is a lot of work. Work for which you will be paid. Work for which you will be in danger of being out of if you neglect data protection. There are only two or three things over which someone in the IT field can completely ruin their career. One of them is carelessly bungling company data through improper backups. Word gets around and the smoke of that failure clings to a person’s garments for years to come.
Consider today what you can do to safeguard your company and your career from backup-borne catastrophe. If you’re overworked and not being enabled to pursue a data protection initiative, shine up your curriculum vitae and go job hunting because you’re, in essence, being told to sit on a bomb and hope it doesn’t do you harm.
Do you have a similar experience? Can you find other lessons to be learned from the situation I’ve just shared? I know of a few that I didn’t form into words. Undoubtedly this anecdote could be turned into quite a series on safeguarding data and implementing proper IT procedures (as well as how colleagues should treat each other; sincerely, everyone involved behaved as a perfect gentleman throughout the ordeal). Are you impeded from properly backing up and testing restores? Do you seem to be unable to communicate the dangers that exist? Cry on my shoulder or shout into my ears in the comments below.
Save your company – and your job
Spot corruption before it’s too late! New SQL Backup Pro 7 includes integrated backup verification. Now it’s a whole lot easier to do full restore + DBCC CHECKDB. Download a free trial now.
Load comments