Monitoring your servers and databases with SCOM and SQL Monitor
SCOM is good at monitoring the status of your servers. SQL Monitor give you a detailed view of your SQL Server instances, and databases, right across your network, however they are hosted. By using tools appropriately, for the tasks they do well, you benefit from a simpler and more comprehensive overall strategy.
Many sysadmin teams will use a tool like System Center Operations Manager (SCOM). It is a useful a system-wide monitor of the availability, performance and security of their server hardware, system services, operating systems, hypervisors and applications. Via its array of Management Packs, you can extend SCOM to monitor a range of different systems. If you need to start monitoring VMWare, you install and configure the relevant VMWare management packs. Likewise, if you need to start monitoring SQL Server, you can install and configure the relevant SQL Management Packs. SCOM will discover and start monitoring all your SQL Server instances and databases, and raising alerts on any errors, abnormal conditions, and so on.
Extending SCOM to monitor SQL Server is an obvious option for any IT Manager who has existing System Center licenses, and system administrators who already use it to monitor other parts of their infrastructure. However, it’s not as simple as it sounds. I have first-hand experience with SCOM, and I have spoken to many businesses who have gone down the route of extending SCOM to monitor their SQL Servers. Many of them encounter problems with configuring the SQL Server Monitoring, alert management, and with the incompleteness of the SQL monitoring data.
The obvious alternative to extending SCOM with the SQL Management Packs is to buy a dedicated SQL Server monitoring tool. As a pre-sales engineer at Redgate, I must confess up front to a natural bias towards this latter option. Nevertheless, in this article I’ll try to offer a straightforward explanation of why, in my experience, you’ll get a better return-on-investment by supplementing SCOM with a specialist database monitoring tool. You then use SCOM to do what it does best, which is monitor the availability and overall health of the servers and network, and at the same time benefit from a dedicated SQL monitoring tool to cover the specialized details of SQL Server monitoring.
How SCOM works
System Center Operations Manager (SCOM), as its name indicates, isn’t a dedicated SQL Server monitoring tool. Rather, it’s a suite of software for deploying, configuring, managing and monitoring all the servers, components and services of a Windows-based IT infrastructure.
Operations Manager is designed to monitor an entire infrastructure, so it’s often a complex beast with many moving parts, pre-requisites and dependencies, server roles, Agents and so on. Microsoft outlines them in its guide to the system requirements for SCOM. The architecture looks something like that shown in Figure 1 (but see the docs for further details).
Figure 1: SCOM Architecture
I won’t delve into details of how all this works, as it’s a big topic, and these details are well summarized in the documentation. However, in short, Operations Manager discovers servers to monitor, and installs an Agent on each one, which will “collect data, compare sampled data to predefined values, create alerts, and run responses“. A Management Server sends configuration details and monitoring logic to each agent, using management packs (more on this shortly) that define the monitoring logic for each component. Monitors collect state data on the ‘health’ of the monitored object, and Rules define what events and performance data to collect and what to do with it.
The default installation of SCOM is essentially just a bare-bones framework on which to hang management packs. Each pack defines both the structure of the application, or service, to be monitored, all its components and their relationships (the service model) and what data should be collected to assess the heath of that application (the health model).
An Agent auto-discovers all the predefined ‘objects’ within the service model for a management pack. For SQL Server, this will include databases, database files, jobs and so on. For each object (grouped logically into classes), it collects the monitoring data defined by the health model, and sends it back to the management server, which stores the monitoring and alerting data in an operational database, and in a data warehouse (for historical reporting).
Common SQL Server monitoring pain points with SCOM
SCOM is a good system-level monitoring tool, for reporting the general health and performance of a diverse server infrastructure, and the services that run on it. It collects data on a vast array of system metrics, services, process states and Windows performance counters, for every server. It will monitor the server event logs. It will report on any failed Server logins, and other Server errors and warnings. It will warn you of CPU, memory or I/O issues, or network errors. It will alert you when storage space is low on a physical disk, or logical volume.
Problems only really emerge if you attempt to extend SCOM to monitor complex applications or services, such as SQL Server. SCOM was never designed to delve into the ‘guts’ of a complex service, to help you work out exactly what is going on in there, when something goes awry. It was envisaged as a way of providing an overview.
Customizing SCOM’s rules, monitors and alerts for the SQL Management Packs is hard, exacerbated by relatively poor documentation. It requires specialist knowledge of both SCOM and SQL Server because it often involves writing custom rules and monitors, or in extreme cases, even writing your own management pack. Even once it’s all set up and working, many teams still find that the resulting monitoring data is hard to understand. Again, specialist expertise is required to interpret the data and act on it.
For a busy operational department, the time spent customizing is wasted time, and the specialized knowledge required translates into training costs. The difficulty for an organization is in understanding the line between what SCOM can and should monitor, and what can be done better, and in a more scalable way, by a tool designed specifically to monitor a specialized service, such as SQL Server.
Customizing SQL management packs can be a slippery slope. To illustrate why it can become a big distraction, I’ll need to describe some issues I’ve experienced in extending SCOM to monitor SQL Server. I won’t cover any of the broader issues such as reliability, relating mainly to the complexity of the SCOM architecture, or Agent management (SCOM also requires an Agent to be installed on each monitored SQL Servers.
Configuring and managing the SQL monitoring
Once we install and configure the required SQL Server management packs, SCOM will check the availability of SQL Server instances and database, the SQL Server configuration settings, the success and performance of Agent jobs, the free space in your database and log files, an array of performance data (via PerfMon counters), and more.
However, it is a large leap from “installing the SQL Server Management Packs“, to arriving at a SQL Server monitoring system that collects exactly the data you need to resolve performance and security issues on your databases, provides useful alerts on which you can take immediate action, and presents all the data to in an intuitive way.
The first task is deciding which management packs you need, and there are a lot of them; different packs for different SQL Server versions, and different packs for different versions of the various components (integrations and reporting services, and so on). Depending on the SQL Server components you have, you’ll need about 15 different packs for each version of SQL Server in your environment.
Figure 2: SQL Management Packs for SCOM
Each management pack comes with its own set of rules and monitors, which means that, for example, if you want to be alerted on a specific CPU utilization threshold, you will need to configure it for each version of SQL Server. It’s easy to make mistakes, and end up collecting duplicate data.
As you scale out the number of servers that you monitor, you can also find that the performance of the monitoring system degrades, as the management server struggles to keep up with the sheer volume of monitoring data generated. At the same time, the SCOM data warehouse can grow rapidly in size. It takes time, experience and very careful configuration of the management packs to avoid these issues.
Over-alerting
If my previous experiences with SCOM are common, then you should prepare to be overloaded with alerts, especially if your SCOM engineer doesn’t get the management pack configuration right. Alert configuration in SCOM often causes confusion. Unless your SCOM expert also has deep knowledge of SQL Server, or collaborates very closely with the DBAs, then they will also find it very hard to define accurate performance thresholds for each SQL Server alert.
As described vividly in this article, a typical result is that the SCOM team use mainly the ‘default’ alert configurations and then “set up a subscription to email the SQL team for all SQL related alerts“. The SQL team are quickly overwhelmed by a deluge of alerts, many of which they can’t act on, or just don’t care about. For them, SCOM behaves more like an ‘out-of-control ticket-raising system’ than a monitoring tool that they can use to fix and continuously improve the performance and security of their database systems.
Fine-tuning the alerting system requires a lot of man hours, coupled with deep, tool-specific knowledge, to minimize the high number of duplicate alerts and false-positives that are generated. Frustratingly, even then, the alerts often contain insufficient information to allow a quick, meaningful response.
Missing and incomplete data
As emphasized earlier, SCOM is a good system-level tool, but will be missing important, diagnostic and troubleshooting data that you need to investigate the lower-level details of the performance problem, such as the execution statistics, query plans, wait states, blocking chain, and so on.
If a built-in rule doesn’t provide what you need, you can “if you have an extra lobe to your brain and the tenacity of a limpet fish” (as a one-time SCOM user put it to me), add custom rules to run scripts to collect extra data.
Often, working with SCOM for SQL Server is a continuous process of discovering the data you really need to diagnose a problem and then adding it:
- A SQL alert is raised but doesn’t provide enough detailed data.
- You triage the problem, manually, and start to understand what data you really needed to have at hand, to help resolve such alerts
- You write custom scripts and implement new SCOM rules to collect it all.
- Now you have on your dashboard the data you need…for that one problem.
- Repeat
Of course, it takes time to write, configure and maintain these custom rules, across all the versions of SQL Server you monitor. In extreme cases, teams have resorted to writing their own custom MPs; in essence, writing their own monitoring system for a given service, and then plugging it into SCOM. It is a high price to pay for an enterprise-wide monitoring tool.
Non-intuitive, fragmented view of the monitoring data
A SCOM dashboard tries to summarize a broad range of metrics from lots of different groups of server objects. They are often visually confusing, and hard to digest.
Again, many of the difficulties are rooted in underlying problems with the SCOM configuration, such as having a high number of Management Packs, and inefficient grouping of the various types of monitored objects. This causes confusion for users, about where to go to find the information they require, and how to drill down into the information they need, to diagnose a problem. This is hard because we are offered a vast array of metrics that aren’t logically correlated, and so there is no easy way to relate activity on one set of monitored objects with the alerts raised for another, related set of objects. If Disk IO is high, for example, it was often hard to connect that directly to activity on a specific database or filegroup.
Figure 3: SCOM dashboard
I remember the claustrophobic feeling that the deeper I drilled in, the further I got from finding the nuggets of information I needed; it felt like playing darts while blindfolded, hoping to somehow hit the target. Sometimes you do, most times you don’t.
It’s up to you to customize the dashboard with the information you want to see. If you’d like to see a baseline for a metric, so you compare current behavior to ‘the norm’, you’ll need to build the baseline. As noted earlier, teams often build the dashboard they need, in a fragmented way, as they learn what data they really need to diagnose each type of problem.
A common result is that, while SCOM offers one focal point for all monitoring of all systems, it is impenetrable to those in the IT team who may be required to respond to alerts, but who lack detailed domain knowledge. In other word, the tool gets no traction among the various teams and so the organization derives little value from it.
Advantages of a dedicated SQL monitoring tool
In organizations that rely exclusively on SCOM for their monitoring, it’s quite common to find that the tool is the preserve of a SCOM expert (singular, often) and that few others want to take the time to learn it. This a far cry from what we’d inevitably refer to these days as a “DevOps” tool, meaning one that provides a common mechanism of collaboration for both those who configure and manage the servers and databases, and the teams who need to use the data it provides it to diagnose and correct problems.
For an organization with existing SCOM licenses, investment in a separate, dedicated monitoring tool for SQL Server, will always be a “hard sell”. After all, what’s on offer, on the face of it, is ‘splintering’ the system monitoring and alerting mechanisms across multiple tools in order to collect specialized data for SQL Server.
However, if your team are, in effect, “rolling their own SQL monitoring”, using bespoke MPs, in order to make it effective, then the costs involved in writing and maintaining all this will likely outweigh the licensing costs of a dedicated SQL monitoring tool.
A tool such as SQL Monitor is designed specifically to help pinpoint the stress points and bottlenecks in the often-complex performance patterns that databases exhibit when under load and give you a clear contextual understanding of what was happening within a database when a problem started.
Easily expand and adapt the monitoring strategy
Adapting, expanding and managing the SQL Server monitoring is relatively easy:
- SQL Monitor is agentless; it installs nothing at all on the SQL Server instances from which it collects data.
- Scaling out the number of servers to monitor is easy, since SQL Monitor supports multiple base monitors, and still collates all the collected data in single web console.
- Out-of-the-box, SQL Monitor is, effectively, “zero configuration”. Once you’ve added a SQL Server, it immediately starts collecting a carefully curated set of the most important SQL Server metrics, at the machine, SQL Server instance and database levels (including full support for Azure SQL databases and pools)
- Expanding and adapting the monitoring strategy for each server is easy, using custom metrics You can add any metric that you can define with a SQL query that returns an integer value.
- It has pre-configured, but easily adjustable, alerts for all the common causes of SQL Server-related problems. It automatically groups alerts, so that instead of receiving hundreds of individual alerts in the inbox, you see groups of alerts as they relate to of the same problem, such as the same long running query.
Detailed SQL Server diagnostic data
The data that SQL Monitor collects is exactly the data that a DBA needs to track use of critical resources, and to diagnose the cause of any abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers, instances and databases. It covers many common issues that, without additonal configuration, SCOM does not. For example:
- Identify and alert on Blocking – provides the full blocking chain, in easy graphical format, so we can diagnose and resolve the blocking quickly
- Identify and alert on Deadlocks – presents the Extended Events deadlock graph in the alert details, plus details of the sessions and queries that were running at the time it occurred
- Identify long-running-queries, and causes of performance issues, including:
- Drill down to query execution statistics, wait states and Query Plans
- Ability to visualize Query plan changes
- Database growth management – provides detailed database and disk growth tracking data and analyses trends in this data to predict accurately when either a disk volume will run out of free space, or a database file will need to grow
Easy visualization, trends and baselines
To understand any service that responds to demand, such as a database or web server, you need baselines. Demand is never consistent and in order to understand problems and help with future capacity and infrastructure planning, you need to compare current performance with past performance.
Drilling further in SQL Monitor, we can intuitively correlate data from multiple metrics, lay it against a baseline and to identify workload behaviors. We can quickly zoom into a specific point in time, and all diagnostic data will be aggregated, contextually, providing efficient root cause analysis capabilities.
Figure 4: SQL Monitor interactive graph, with time-slice resource use and query activity
Each alert can be viewed within the context of overall activity on the affected server and will provide the essential information required to describe the problem. This means that a “first responder “does not necessarily need specialized knowledge of the monitoring tool, or of SQL Server, in order to either respond, and close the alert, in simple cases, or to ensure that the right person or team is made aware of the issue, promptly.
With a large server estate, the reality is that for some or most of the time, the people doing the first line monitoring won’t have deep domain expertise. If a SQL monitoring tool makes its data intuitive and accessible, then it is also one that extends easily beyond production. Your developers can use the data too. Monitor your dev and test servers too, and suddenly you have a proper DevOps collaboration tool, and a route to the purest form of proactive monitoring; detecting and stopping problems before they even get to production.
Using SCOM and SQL Monitor together?
SQL Server is just a service on a Windows or Linux server. Tools like SCOM and Nagios work well to provide the status of the server. SQL Monitor can give you a view of your SQL Server instances right across your network, however they are hosted. By using tools appropriately for the things, they do well, you can benefit from a simpler and more comprehensive overall strategy, even though you need two different tools. The idea that, by standardizing on a single tool you somehow run more efficiently could be a false assumption. After all, using a hammer and a chisel is far better than carving with only a hammer, especially if the chisel is sharp.
Summary
Any operations team needs an overall view of the many components in a network, the services, devices and applications. It makes sense to standardize on tools for a particular purpose. However, the more that you expand and customize a tool because it doesn’t quite fit your requirements, the more time that is wasted, the more the team gets immersed in the task, and the more that your team fragments into specialized areas of expertise. Tools like SCOM and Nagios are fine for the use for which they were intended, but the demands of a specialized service such as a SQL Server instance go well beyond the capabilities of these products. On the other hand, a SQL Monitoring system would be ineffective for monitoring the servers hosting SQL Server, and the network itself. You need the right tool for the job.
Tools in this post
Redgate Monitor
Real-time SQL Server and PostgreSQL performance monitoring, with alerts and diagnostics