Updated December 10, 2014
Not too long ago, I was a day-to-day DBA, in the trenches and in the line of fire, overseeing 100+ SQL Servers in my company’s large infrastructure. On my first day on the job, I received a single-page Excel spreadsheet, containing a list of server names, color-coded according to their availability level. I had no idea how many databases were on each server, how big these database were, how many users they had, or even which versions and editions of SQL Server were in use. There was no dedicated monitoring solution for our SQL Servers, and little documentation. It was a daunting task for a new DBA and the first week was a frantic discovery process where I documented, manually, as much as I could about my SQL Server infrastructure.
“In essence, the monitoring tool is an early warning system for DBAs.”
We’ll call this Point A.
Skip forward a few years and now, instead of overseeing SQL Server instances as a DBA, I oversee a team of great DBAs who are watching over the instances. We’ve now fully documented all of our SQL Server instances, and the data is stored centrally for analysis and reporting. We’ve invested in numerous tools that monitor various aspects of our database systems, ranging in price from free (i.e. hand rolled) to tens of thousands of dollars on systems where we felt the cost was worth the benefit.
Let’s call this Point B.
In this short paper, I’ll describe how and why we got from Point A to Point B, the business value I perceive in monitoring tools, and the decisions that have driven their adoption (or otherwise) at our company.
What’s required of a Monitoring tool
The basic goal of any monitoring tool is to alert the DBAs as soon it detects any abnormal resource patterns, such as sustained CPU spikes, or a specific problem, such as a failed backup, on a SQL Server. In essence, the monitoring tool is an early warning system for DBAs. It allows us to respond to a problem quickly, before too many people are affected, and the crisis escalates. Crucially, of course, we want the data that the monitoring tool provides, and the alerts that arise from it, to be meaningful, reliable and accurate, so that we have enough information contained in the alert to take immediate corrective action, and so that our waking and sleeping hours are not plagued by over-alerting or ‘false positives’.
It must also provide some form of effective reporting and trend analysis of the data it collects, so that a DBA can see the train crash coming, and “re-route the tracks” to avoid disaster; for example, we can re-provision the server before we reach the I/O, or memory or disk space crisis.
While having the ability for the tool itself to take corrective measures is a feature worth having, on a limited scale, it is not a requirement. For example, we don’t want the system to have the ability to decide to kill a blocked process or recycle SQL services. Restarting a backup job, on the other hand would be advantageous, but even then would require some level of human oversight. Also, any good backup solution will include these retry mechanisms anyway.
Server-level monitoring is not enough
Back in the early days, somewhere near Point A, we had a hybrid collection of system monitoring tools, many of which I had no insight into because these were under the control of the server or network admins. I was able, over time, to convince the other teams to expand their physical and environmental alerts to include basic SQL Server monitoring, above and beyond the scope of “SQL Service Up or Down”…and as if by magic, I was inducted into the wonderful world of endless Blackberry alerts, very late nights and frustrating false positives.
The latter often arise from network availability issues. If the monitoring tool cannot connect to a SQL instance it will assume the service is down, when that may not be the case. At least once a quarter, we get a deluge of alerts, which turn out to be erroneous. However, it is not always safe to assume so, as this may be a legitimate global outage. Someone trips over a power chord in a data center at least once every three years.
Among our alerts, we were also subject to a constant barrage of space issues resulting from misconfigured database settings, or application time out issues, and some very odd blocking of processes.
The only way out of this situation was to acquire deeper knowledge of the configuration, settings and activity on all of our SQL Server instances; in other words to set up some SQL Server-specific documentation and monitoring solutions.
The good and bad of “roll your own” monitoring
I started building a documentation solution, via SSIS, to automate the manual work I had been doing to gather information about my SQL Servers, and to centralize all of that knowledge in its own database.
I call it a “documentation” solution because this is largely what it is; it gathers, collates and reports on basic aspects of the SQL Server architecture and helps us enforce standards. For example, it reports on installed versions and editions of SQL Server, data and log files sizes and growth, backups (which ones are being taken and were they successful), and so on.
It is not a performance monitoring solution although it does gather real-time statistics on the hardware side, as well as collecting some data on SQL Server wait stats and memory utilization. This data at least offer some limited insight into activity and resource usage on our servers, via monitoring of, for example, long-running agent jobs, and the number of server connections, per day; typically, we refresh the data only every 4 hours.
On the plus side, the solution proved useful to, and popular with, other DBAs in the team, and they adopted its use. On the downside, its popularity put a lot of pressure on me to maintain and improve it over time, in amongst all my other duties. It was also missing one key component that I had to have in any SQL documentation solution and that was historical trending.
Evaluating SQL Server Monitoring tools
Over recent years, I, or others in my team, have tried out several commercial monitoring tools and solutions and with each one the decision to move forward with them for a full-scale rollout to all servers boils down to a few key factors.
“…if you turn up in your manager’s office with the news that “server x requires 2 TB of extra disk space, immediately, due to unexpected growth” expect his or her first question to be: ‘Why was it not expected?‘”
What is the cost per server and is there a site license?
The cost of various solutions we’ve tested varies widely, as do the features of each tool, and the adage of “you get what you pay for” does not necessarily always apply to software. A site license for one monitoring system, which we deemed worthy of our attention, approached the 120K dollar mark. However, this system provided monitoring for every single server, not just every SQL Server. The SQL Server monitoring features were extensible and there were more than a handful of useful SQL-centric alerts.
Other monitoring systems we’ve evaluated were licensed per server and the cost was generally in the range of 1K per server. This forced a decision to only use these tools, if we were to acquire them, on production systems. Why would you monitor a test or development server anyway, right? Well, they suffer more than their fair share of ailments at the hands of developers whose queries are pounding away at all hours of the night.
What does it monitor and how does it alert on problems?
This is an important question for us because we have introduced so many ways over the years to monitor our SQL instances that it is easy to get into a situation of “alert overload”, caused by a combination of failsafe alerts and duplicate alerts from different tools.
For example, Red Gate’s SQL Backup sends email alerts for failed backups. As a failsafe, we also have the SQL Agent Job that runs the Red Gate code send a mail on failure. This means we get an email if a backup failure occurs, as well as an alert for the SQL Agent job failure. These emails can also auto generate a Help Desk ticket and routes to the DBA group.
We get alerts for disk space not just from our data and log files, and
empDB, but also from our backup shares, where we store the backups. We get blocked query and long running jobs alerts from another system and information about actual query performance from even another system.
On the plus side, these alerts mean that we’re able to respond quickly, around the clock, to ensure there are no interruptions to availability. On the downside, if multiple monitoring tools are needed to cover every required feature, it becomes a quite a task in itself to “monitor” the monitoring tools to ensure that all the monitoring mechanisms are working appropriately, and it can also lead to duplicate alerts.
The bottom line for us is that the tool should include almost all of what we need, that it alerts us via email, has the ability to send multiple alerts on a threshold/escalation mechanism and that it resets itself afterward.
Does it provide trending of monitored servers and reports displaying the trends?
Trending is critical for me, as a manager, although not so much for the on-call DBA who is “in the moment”. I need to be able to see, at a glance, trends in failures around certain times of the year/quarter/month as it will uncover bottlenecks or shortcomings in our capacity planning.
It also gives me insight into resource management, of the human kind. How many DBAs does it take to manage 200 issues a month? Do I need to hire another DBA? Alternatively, can we find a way to reduce significantly the number of issues?
Is there a central monitoring database or repository and is it a web application or heavy client?
Put simply, we want a web application that anyone can get to from anywhere. Having a thick client application is always a burden because it requires an RDP session for example, to the central server, or it requires an install on the DBAs laptop and is generally slow to connect and pull down date, especially for a WAN/VPN connection. A phone app is also appealing, but not a requirement.
Does the solution pose a risk to performance of the servers and is it intrusive?
One question we always have to ask of any monitoring tool vendor is this: is an agent installed on the SQL instance. If so, why and what is it doing?
Most likely, it is just communicating with the mother ship, on the central storage server housing the monitoring database. However, we always need to know exactly what it is doing as over the years I’ve learned to expect the unexpected. We’ve seen examples in the past where the actions of a monitoring tool were actually affecting the performance of the normal database workload. In one case, a tool was changing a trace flag setting in order to capture verbose output, during query analysis. Suffice to say, without going into specific details, it caused havoc.
Is it extensible with custom code or events?
We absolutely have to be able to extend a base set of alerts, unless the base set covers everything we think we might need now, or ever. Of course, that will never be the case because we are a very creative DBA team.
Taking the Plunge: Build versus Buy
“It does not matter too much to upper management whether the DBA designed their own monitoring solution or paid to get the solution. All that matters is that the solution works to address the problem of saving time and averting disaster. “
Ultimately, all of the dollars spent, and all of the monitoring tools acquired, combine to do two things:
- Save time for the DBAs and Users
- Allow us to respond to issues early
We save time by using the tools to discover poorly written code, or misconfigured servers, which are slowing down business processed and which we can fix by query tuning, or adding additional horsepower (memory, CPU, and so on).
Alerted early, we can respond early, before an impending issue becomes a fall-blown catastrophe, seriously affecting users; this is without a doubt the most important area for us in IT. If we were to wait hours to respond to a failed backup it might bleed into the business hours and affect performance or worse, very much worse, lose data in the interim. If we don’t respond to space alerts, backups will fail or restores will fail and data is not available. If data is not available, someone will not be happy and time is wasted. If someone is not happy then that someone tells my boss about his or her unhappiness.
Any decent boss will know the tools can be built or bought that will avert many of these problems, so if you turn up in your manager’s office with the news that “server x requires 2 TB of extra disk space, immediately, due to unexpected growth” expect his or her first question to be: “Why was it not expected?“
It is much better to plan ahead, and get the tools in place so that, instead, you can show your boss a report that says Server x will be out of space on this server in 4 months and that this needs to be accounted for in the budget for that project.
It does not matter too much to upper management or executives, whose reports are not on time, whether the DBA designed their own monitoring solution, or manually discovered the issue, or paid to get the solution. All that matters is that the solution works to address the problem of saving time and averting disaster.
When a DBA approaches me with a proposal to spend x dollars to buy tool y, I would not really expect any sort of quantitative “ROI” assessment (unless the value of x actually makes my eyes water). However, I would expect their proposal to show a good understanding of why it’s required, the sort of benefits we can expect, and careful consideration of all the previously listed questions.
If the proposal were to build such a tool, I’d expect all of this plus a very careful assessment of the time required to build and maintain the tool internally, and the impacts. In my experience, DBAs do not, in the main, have time to develop the kinds of systems required to answer a lot of these questions and surely do not have time to manually check each aspect of a SQL Server infrastructure. They are often too busy fixing all of the alerts that they are getting to extend the alerting systems.
I am much more inclined now to look for a monitoring system, from a well-known and trusted source, which has the required features and where I know the code will be enhanced over time and fully supported. That is much more efficient than to ask my already stretched DBA to add a new event to his C# code he worked on in his spare time last month.
Towards Monitoring Nirvana
The following summarizes some of what I’ve learned about achieving monitoring nirvana without losing too much of your mind:
- Get an affordable and extendable SQL Server monitoring tool, if you have no monitoring at all; it will help you automate the process of discovering and adding your SQL servers to get the basic monitoring in place. A monitoring tool will provide 95% of the coverage you need, with the other 5 percent coming from other processes, like backups and restores, SQL Agent jobs and reporting failures
- Configure a base set of alerts, using the monitoring tool
- The core components (onion, carrot and celery) of the monitoring stew are CPU, Disk and Memory monitoring.
- Most good monitoring tools will bake in a whole range of wait stats and performance metrics, from Page Life Expectancy to Buffer Cache Hit Ratio to Recompiles and Log Flushes
- As the needs arise, “season to taste” with your own custom alerts using T-SQL and Dynamic Management Views, or even your own stored procedures.
- Attaining 100% coverage will take time, as well as testing, tweaking and probably a few grey hairs; you will grow to hate your Blackberry, but remember:
- Ultimately, these alerts are helping you find and fix issues, some of which you never even knew you had
- The pain will abate, as you learn the patterns of your servers’ activity, refine your alerting, fix false alerts, and tweak escalation profiles
- Don’t be afraid to temporarily suspend an alert if its causing an unacceptable deluge of mail; check to see if your tool allows you to respond to alert by email, so the system knows you got the alert, are working on the issue, and don’t need to receive 200 similar alerts in the meantime.
- Use the tool’s trend reporting – and if you’ve been diligent in fixing issues and are seeing improvements across the board, don’t be shy in sharing such reports when it comes time for your performance review, or when you need to ask for the money to buy more server licenses.