Any SQL Server monitoring tool must gather the metrics that will allow a DBA to diagnose CPU, memory or I/O issues on their SQL Servers. It should also provide a set of accurate, reliable, configurable alerts that will inform the DBA of any abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers.
This article provides an in-depth guide to the monitoring and alerting functionality available in one such tool, Redgate SQL Monitor. It focuses on the latest edition (5.0), which includes several key new features, such as performance diagnosis using wait statistics, the ability to compare to baselines, and more.
Any SQL Server monitoring tool must gather the metrics that will allow a DBA to diagnose CPU, memory or I/O issues on their SQL Servers. It should also provide a set of accurate, reliable, configurable alerts that will inform the DBA of any abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers.
This article provides an in-depth guide to the monitoring and alerting functionality available in one such tool, Redgate SQL Monitor. It explains how the tool works, reviews its features, its architecture, shows how to configure it, and offers some practical examples of how to use it to monitor your SQL Server database servers and instances. It covers the latest version of the tool (5.0), which supports versions of SQL Server from 2005 up to and including SQL Server 2016.
The tool provides core monitoring functionality, such as built-in and customizable alerting, performance diagnosis using wait statistics, and the ability to capture and compare performance baselines, as well as new features such as Availability Group monitoring, and integration with Active Directory.
When I first started out as a SQL Server DBA, I was in charge of about 20 instances. My morning ritual was to log remotely into each instance, one at a time, and work through a series of checklists, verifying each instance was running as expected, checking disk space, confirming that all overnight jobs ran successfully, examining error logs in case of issues, checking key performance counters, and much more.
Even with only 20 instances, this manual checking was very time consuming. In fact, on some very busy days, I didn't even have time to check every instance. Also, the process got boring and tedious very quickly.
If Brad's description of his early days as a DBA bears any similarity to a typical day for you, or if you spend a lot of time gathering performance and other metrics from your SQL Server instances and databases, manually, in response to various reported issues, then it's likely that a SQL Server-specific monitoring solution will save you a lot of time.
The monitoring solution will gather the metrics you need to diagnose any CPU, memory or I/O issues on your SQL Servers. It will provide a set of configurable alerts that will inform you when an error occurs, when jobs fail or fail to run, database or server properties deviate from the norm, there are sustained spikes in resource usage, or abnormal trends, and so on.
SQL Monitor is both a performance monitoring tool and an alerting tool. It continuously monitors a set of target SQL Servers, collecting a wide range of server-, instance-, and database-level diagnostic data that will help a DBA to evaluate current activity levels on the server, and to identify resource bottlenecks, query performance problems, and other common errors and issues on their database systems.
SQL Monitor raises an alert whenever a particular event occurs (such as a SQL Server error or a job failure) or whenever some pre-defined, but configurable, threshold value for some metric (such as disk space usage, or query duration) is exceeded. Via the SQL Monitor web interface, users can drill into the list of all the alerts raised on the monitored servers, along with various overviews and summaries of the collected diagnostic data.
Therefore, for every alert raised, we can see various snapshots, aggregations and summaries of resource usage and query activity on the server around the time of the alert, helping us to determine quickly what might have caused it. We can also compare resource metrics at the time a certain issue occurred to their baseline values at a previous time, such as the same period on a previous day, in order to see exactly "what changed".
Alongside the pre-set summaries of the data that SQL Monitor automatically displays in the web interface, we also create analysis graphs displaying the values of multiple performance counters over a desired time range. Outside the web interface, we can even run custom SSRS reports directly against the SQL Monitor database.
Users access the monitoring data and alerts via the SQL Monitor web interface, meaning the data is accessible to authorized users on any device that has a web browser.
SQL Monitor collects all its data using a lightweight, agentless, 3-tier architecture that minimizes the danger of the observer effect (where the act of collecting performance data from a target server affects the performance of that server), and stores all the historical data in a SQL Server SQL Monitor database
Perhaps the biggest strength of SQL Monitor is that it is both pre-configured and highly customizable. It comes with a pre-defined set of base metrics that will allow it to start collecting server-, instance- and database-level metrics as soon as we add the first SQL Server we wish to monitor. It also comes with a pre-configured set of alerts for common resource usage, query performance and common errors and issues.
In other words, out-of-the-box, with minimal set up and zero configuration, SQL Monitor satisfies the Pareto Principle (also known as the 80-20 rule); it will not identify and raise an alert for every possible SQL Server-related problem, but it will catch the most common ones.
From this pre-configured base, we can start monitoring and alerting immediately and, as we learn more about our servers, and the level of alerting, start manually adjusting alerting thresholds so that they are best suited for the level of activity on our servers. SQL Monitor tries to configure sensible threshold values for alerts, but there are very few threshold values that aren't fully dependent on the specific environment, and on the applications and the patterns of activity on your servers.
Alert tuning, and even disabling alerts, if necessary, is a vital part of maintaining a strong signal-to-noise ratio and therefore a healthy monitoring system that the team actually use rather than just attempt to ignore. Over time you need to tune and adjust your monitoring system so that all alerts are meaningful, and allow the team to take fast, corrective action.
Once comfortable with the level of alerting, and confident that the most important alerts don't get lost in a "sea of noise", we can start to build out our monitoring and alerting capabilities. SQL Monitor lets us write our own Transact-SQL code to create custom metrics that will capture the specific monitoring data we need from our servers, instances and databases, and create alerts on these custom metrics, as required. We'll cover the creation of custom metrics and alerts later in the article.
SQL Monitor's key mode of operation is to raise alerts based on abnormal or undesirable conditions and properties, as well as specific errors, on any of the monitored servers, instances and databases. For example, it might raise an alert if the monitoring data reveals:
NONE, or an index's fragmentation level exceeding a certain value
We can find a full list of the pre-defined alerts at monitor.red-gate.com/Configuration/Alerts. Out of the box, SQL Monitor will raise 22 alerts directly related to our SQL Server instances, ranging alerts on long running queries, to overdue backups, to blocking and deadlocks. It will also raise another 11 related to the host machine (disk space, memory and CPU usage, and so on). There are also a further 7 alerts specific to Availability Groups. Each alert is pre-configured with default thresholds, but it is very easy to adjust the threshold settings.
The monitor.red-gate.com website is a live copy of SQL Monitor that allows anyone on the Internet to view the alerts and performance of the two SQL Server instances that run the backend databases for both the SQLServerCentral.com and Simple-Talk.com websites.
Later in this article, we'll review some specific examples of these alerts, and the supporting diagnostic data that SQL Monitor presents for each one.
SQL Monitor comes with a pre-defined set of base metrics that means it can start collecting server-, instance- and database-level metrics as soon as we add a SQL Server. It also collects various server, instance and database properties, all of which are useful in identifying the root cause of any alerts raised on the monitored servers.
It interrogates the Windows server for metrics such as CPU and memory use, as well as disk space usage. At the SQL Server instance and database-levels, it uses various built-in functions, system tables, views and performance counters to collect a wide range of SQL Server and diagnostic data and properties that can expose resource usage issues (for example, high memory or CPU use), performance issues (such as long running queries, blocking), as well as specific error and problems (such as deadlocks, index fragmentation, SQL Server errors). It also collects other important information, such database configuration properties.
The following sections summarize, briefly, some of the key predefined metrics that SQL Monitor collects. As noted earlier, and as I'll demonstrate later, we can expand this list by defining our own custom metrics.
SQL Monitor tracks a wide variety of Performance Monitor counters to identify bottlenecks in CPU, memory, disk I/O, or network resources. To make this easier, Performance Monitor counters are divided into four categories:
For a full list of all the Performance Counters collected by SQL Monitor, visit monitor.red-gate.com/Analysis/Graphs
As well as viewing strategic snapshots of this data, at various places in the web interface, we can also perform custom analysis of these metrics, on the Analysis tab in the SQL Monitor web interface (see later).
Alongside PerfMon metrics, SQL Monitor tracks data held in various Dynamic Management Views (DMVs), offering deeper insight into the workload distribution across an instance, and into databases or specific 'hot' tables that are the source of long running queries and resource waits.
SQL Monitor sets alerts on current, long running queries, based on data from the
sys.dm_exec_ requests DMV and also queries information from
SQL Monitor uses this information to present a list of the most expensive queries, by instance and by database, plus the SQL text of the executing batch and associated query plan handle, at the current time or at the time of an alert. You can sort these by execution time, count and resources used as necessary.
SQL Monitor also captures which resource waits have affected that query, if any.
Every time a session has to wait for some reason before the requested work can continue, SQL Server records the length of time waited, and the resource on which the session is waiting. For currently-waiting tasks, it exposes the waits through the
sys.dm_os_waiting_tasks DMV. SQL Monitor mines this data to present a list of the top ten resources waits on the server over a given time (such as for the previous 15 minutes), so we can see if the most significant waits are, for example, related to log writes, or waits to acquire locks and latches, or waits on some other resource. Clicking on a significant wait type reveals any queries that were delayed by this wait, query plan details, target database and the duration of the wait.
For top-level, system-wide waits, SQL Monitor also mines data from the
sys.dm_os_wait_stats DMV, which exposes wait statistics aggregated across all session IDs, to provide a summary review of where the major waits are on a given instance.
SQL Monitor collects diagnostic data that allows it to report fragmented indexes in any of the monitored databases, and will raise an alert at a pre-defined fragmentation threshold.
Currently, for SQL Server 2000 compatibility, it collects this data by running periodically the deprecated
DBCC SHOWCONTIG command, although it will likely switch at some point soon to using the
By either route, collecting this data is potentially resource intensive, because SQL Server will need to scan every index page, and collect statistics, each time. For this reason, SQL Monitor only collects this data once a week, at 2AM on Sunday morning, and uses the FAST collection mode (roughly equivalent to using
LIMITED mode with the
sys.dm_db _index_physical_stats function).
While PerfMon and DMV data is very useful when troubleshooting problems, sometimes we need even more detailed information. SQL Monitor has the ability to collect selected Profiler data, as needed. Normally, this feature is left off because it can be resource intensive to collect lots of Profiler data, even though a server-side trace is used. Generally, it is only turned on to help provide more extensive data about specific alerts. For example, if tracing has been turned on, and a specific alert fires, then you will be able to see the Profiler events that occurred before and after the alert.
The Configuration | Trace page on SQL Monitor displays a warning icon to make it easier to see which instances have trace enabled.
As well as expensive queries running against SQL Server instances, it is often important to know what processes are running on your SQL Server box, and this feature keeps track of the most resource intensive processes running on your server, in terms of CPU time. This will help to determine if it is a SQL Server instance, or perhaps another application, that is using up valuable resources.
For each server, instance, and database, essential properties are collected, such as the OS version, SQL Server version, Service Pack levels, and much more. This information helps the DBA stay informed about the basic configuration of the environment.
SQL Monitor presents the data it collects in various ways across the Overviews, Alerts, and Analysis screens that comprise the three main sections of the SQL Monitor web interface.
The Overviews screen presents various predefined views into the monitoring data, offering summaries at the machine right down to database level of what's happening on the system right now, or at a specified time:
Server / operating system level, including:
SQL Server instance level, including:
Database level, including:
The Availability Group level, including:
On the Overviews screen, at any level, we can use the back in time feature to see a summary of the metrics at a previous point in time.
The Alerts screen displays details of all current, un-cleared alerts for all monitored SQL Servers. Accompanying the details of each alert, SQL Monitor provides a summary of diagnostic data collected at the time the alert was raised.
We'll look at alerts in more detail later in the article, but the supporting data includes summaries of:
On the Analysis screen, we can plot specific metrics, namely performance counter metrics plus custom metrics, over time.
We can also display and overlay multiple metrics for a given time period, and we can create metric baselines, and compare them to usage levels for the same metric, at the time of the alert. Again, we'll discuss how this works a little later.
Before we dive into the details of how to install, configure and use SQL Monitor, we need to take a high-level look at its design. SQL Monitor is divided into three key components:
SQL Monitor is an agentless architecture; it collects diagnostic data without the need to install an agent on each SQL Server instance. Instead, the base monitor connects remotely to each server, collecting the data using a combination of WMI, T-SQL, remote registry calls, remote file access, and ping.
While an agentless environment minimizes the monitoring overhead, it is still unwise to install the SQL Monitor components on the same machine as the SQL Server instances we wish to monitor. The overhead of running the base monitor and web server, as well as writing all the data to the repository, which can grow large on a busy server, could affect the performance of the monitored server (the Observer effect). Equally, it's of little help if the monitoring service, responsible for telling us that something's failed on the server, goes down at the same time as the server.
Therefore, at the very least, we need one additional machine to host the web server, base monitor and SQL Monitor database, although the recommended configuration, generally, is two separate servers, one to host the base monitor and web server and one to host the SQL Monitor database. Ultimately, if we can install each of the three components on a dedicated server, it will allow us to scale SQL Monitor to monitor the largest number of SQL Server instances.
The SQL Monitor 5 documentation provides an overview of the SQL Monitor components, reproduced in Figure 1.
Figure 1 – SQL Monitor architecture
The Supported platforms and hardware and performance guidelines section of the documentation provides full details of supported platforms for the three SQL Monitor components, as well as supported SQL Server versions for monitoring, and supported browser versions for the web clients. It also offers hardware guidelines for the machines hosting the base monitor and SQL Monitor database.
The SQL Monitor installation process is straightforward; simply run
SQLMonitor.exe on the server on which you wish to install the web server component (and possibly also monitoring service) and complete the installation steps in the wizard. The installation documentation covers these steps in detail, so here we'll just review briefly the main considerations.
You can either install the web server and monitoring service on the same machine, or on separate machines. If you choose the latter option, you'll install the web server first, and then a separate installer allows you to install only the monitoring service, on a separate server.
For the examples in this article, we used a 2-server architecture for SQL Monitor, and so installed the web server and base monitor on the same virtual machine.
The default option in the installation wizard is to install the dedicated SQL Monitor web server, which uses the .NET 4 runtime. If you chose this option, SQL Monitor will install and configure the SQL Monitor 5 Web Service. See the documentation for the required permissions for the Web Service account.
However, the installation process will also detect if the server hosts a supported version of IIS (v7 or later) and, if so, the alternative is to add SQL Monitor to the existing IIS server. If you choose to use IIS, the SQL Monitor Web Service is not installed.
If IIS isn't installed, it's worth installing the dedicated SM web server, and you may still opt to do so even if the machine already hosts IIS v7 or later. The dedicated web server is much easier to set up, as you don't have to juggle any IIS permissions, and is very lightweight.
The examples in this article use the dedicated SQL Monitor web server. By default, it uses TCP port number 8080 for incoming connections, and you can test the port assignment to ensure no other program is already using it. Here, for this reason, we assign the SQL Monitor Webserver to port 8081.
If you want to access the SQL Monitor web server outside your organization's firewall, you will need to configure the web server so that it can be accessed through the firewall.
In this example, we simply install the monitoring service ( a.k.a. base monitor) on the same machine as the web server. The Base Monitor service uses default TCP port 7399 to communicate with the Web Server. The base monitor will need an account that has access to the SQL Monitor database, as well as an account, which may be the same account or a different one, to access to each of the SQL Server instances and databases that we wish to monitor. The former we specify during installation and the latter post-installation, during initial configuration. The Requirements section of the documentation specifies the minimum permissions required for each of these accounts.
The base monitor can connect to the SQL Monitor database using Windows (the default, and used here) or SQL Server authentication.
If you run Services.msc on the machine where you installed the base monitor, the name of the base monitor service is SQL Monitor 5 Base Monitor. If you ever see the message "Unable to detect base monitor" when starting up SQL Monitor, check to see that the service is still running (it will stop working if, for example, we change the account password and forget to update the service to use the new one).
The final major step in the installation is to set up the SQL Monitor database. This will store all the data collected from each of the monitored SQL servers. Depending on the size of your SQL Server databases, and activity on your servers, the SQL Monitor database can grow large, quickly, and it's essential to pre-plan for this during set up.
The documentation estimates that monitoring one server, hosting a single SQL Server instance, will require between 150 and 450 MB of storage per day. Therefore, for 10 servers, we can expect the SQL Monitor database to grow by 10 to 30 GB per week.
Decide how many weeks' worth of monitoring data you wish to retain (covered in more detail in the Configuring SQL Monitor section, later) and then create a custom SQL Monitor database, with initial
filegrowth for the data and log files that will avoid excessive autogrowth events, which can cause file fragmentation.
If, instead, you decide to let SQL Monitor create the SQL Monitor database for you, it will create a repository database called
RedGateMonitor using the defaults supplied by the model database for that instance. It then alters the database, first to set the initial size, and growth increment, of the data file to 512MB, and then to set its recovery model to SIMPLE.
For the examples in this article, we created a new SQL Server database, sized appropriately, and supplied a Windows account for the base monitor to connect to the database, as discussed earlier.
With the installation process complete, all we need to do is launch SQL Monitor. The first time we do so, create the administrative role for the SQL Monitor web interface and then log in, and start adding the SQL Server machines and instances that we wish to monitor.
Administrator user has full access to all SQL Monitor features. Subsequently, the administrator can create less-privileged user roles (covered in the Configuration section, a little later in the article) for users who only need to view and analyze the monitoring data.
We can access SQL Monitor from any web client, simply by supplying the URL with the address and port number for the web server component:
On the local machine hosting the web server, we can also launch SQL Monitor from the Start menu.
In order to start using SQL Monitor, we just need to add some SQL Servers for it to monitor. Having done so, it starts monitoring immediately, based on a pre-defined set of base performance metrics and alerts.
We add new server and instances in the Configuration | Monitored Servers section of SQL Monitor (we can also reach this page from the left hand menu of the Global Overviews page)
To add a new server, simply click on Add SQL Server.
Figure 2 – Adding a SQL Server instance
We need to supply the following information:
By default, SQL Monitor will use the Base Monitor service account, which we set up during installation to allow it to connect to the SQL Monitor database, for both the host machine account and the SQL Server instance account.
However, it is a best practice to create separate, dedicated domain accounts for each credential. The host machine account will generally have local admin rights to the server where the SQL Server instance is located, and the SQL Server instance account will generally have SQL Server
sysadmin rights, although in neither case is the use of such privileged access mandatory (see Requirements).
As we add new servers over time, the Monitored servers page will provide a full list of all monitored servers and instances, reporting their current status and allowing us to perform actions such as edit credentials, suspend monitoring, retry a connection, and set a maintenance window.
New to SQL Monitor 5 is integration into Active Directory to help manage and control access to the monitored SQL Servers. If certain users/groups are not allowed to view a certain server then that server won't appear in SQL Monitor when viewed by those users.
On the Configuration page, click Authentication Settings and you'll see options to log in to SQL Monitor using either SQL Monitor credentials, or Active Directory credentials. If you choose the latter, you'll need to supply at minimum a domain name and credentials for a service account in Active Directory.
Having done so, tested the connection and saved the settings, you'll see the message in Figure 3, where we need to confirm the AD user or group that will have administrative access to SQL Monitor.
Figure 3 – Adding an AD user or group as an Administrator
We can then use the Configuration | Manage Users page to create lower-privileged AD users and groups that have access only to specific servers.
Figure 4 – Adding a standard user or group with access only to specific servers
If permissions conflict, the least restrictive is taken. For example, if a user is a member of an AD group that doesn't have access to a certain server but also a member of an AD group that does have access, then SQL Monitor will give that user access.
SQL Monitor allows us to track alerts and analyze associated monitoring data across three different screens, as described previously:
Combined, these tools offer a degree of proactive control over the performance of our database systems, as well as the ability to respond to emergencies as they arise.
Having installed SQL Monitor, and added some SQL Server instances, it automatically begins to collect data and raise alerts.
SQL Monitor raises an alert in the following situations:
We can configure each alert, setting the desired threshold levels appropriate for our environment, or even disabling it, if required. We'll cover alert configuration in detail later in the article.
When SQL Monitor raises an alert, it stays raised until someone with proper permission clears it. It is very important that the team are diligent in clearing alerts that they have investigated and resolved, or that simply don't require immediate attention. In this way, the list of alerts SQL Monitor displays will always provide an excellent place to begin troubleshooting.
The first screen that comes up upon starting SQL Monitor is the "Global Overview" screen, as shown in Figure 5.
Figure 5 – The "Global Overview" screen in SQL Monitor
The Monitored Servers explorer, on the left, lists host machines and SQL Server instances in a hierarchy. The "(2)" next to Global Overview indicates one host machine (called win-3xxx) and two instances (local and dojo), although of course a host machine might run two or more SQL Server instances.
The main portion of the screen summarizes all un-cleared and unread alerts across all monitored machines and instances. SQL Monitor raises specific alerts both for the host machine level, and at the SQL Server instance-level. The highest level of the hierarchy (win-3xxx, in this example), rolls up the alerts for all the lower levels. In Figure 5, we see some un-cleared alerts, two high severity (red) alerts, two medium severity (orange) alerts and 2 low severity (blue) alerts.
If we click down to the instance level, we see the alerts per database on the local instance.
Figure 6 – Alerts at the instance level
We can drill down into the details of the alerts either by clicking on the number of un-cleared alerts listed for a server or instance, or by switching to the Alerts screen. In this example, I clicked on the five un-cleared alerts listed for the local instance, as shown in Figure 7.
Figure 7 – The Alerts screen showing six alerts
The severity of the alert runs from Low to Medium to High, with the latter being alerts that should be of immediate concern, as they are the ones most likely to lead to outages or major performance problems. In the ~Actions column, we can, for example, mark an alert as read, or clear an alert, although we're more likely to do that on the screen for each individual alert, as we'll discuss shortly.
Simply by clicking on each alert, we can see the full details for each alert, when it occurred, and on which database or instance, along with snapshots and summaries of resource usage and activity levels on the host machine, and SQL Server instance, at the time of the alert, as well as details of SQL Server queries and processes, or other server-level processes, which were running concurrently.
To give you a feel for the troubleshooting process, let's drill deeper into one of the three alerts shown in Figure 7, namely the Deadlock alert.
Clicking on any alert takes us to the Details tab for the alert, as shown in Figure 8.
Figure 8 – The Details tab for a Deadlock alert
At the top, for any alert, we see the time the alert was raised (and ended, if relevant). Below that we can see the current status of the alert. In this case, someone has read the alert but it has not yet been cleared. We can simply click those boxes to changes its status. We can also add useful comments, perhaps regarding resolution of the problem.
In the body of the Details tab, we see the details of the alert. Obviously, the information presented here is specific to the alert. In this case, we see details of the sessions and objects involved in the deadlock.
We can see straight away that processes 57 and 59 were engaged in a deadlock and that SQL Server chose process 59, an
UPDATE on the
Person.AddressType table, as the deadlock victim and rolled it back.
The Lock details column of the output provides details of the database and specific objects involved in the deadlock. From this, we can see that Process 57 holds an Exclusive (X) lock on the Primary Key index of the
Person.Address table and Process 59 holds an Exclusive (X) lock on the Primary Key index of the
Person.AddressType table. Each process needs to obtain an X lock to update rows in the index on which the other process already holds a lock, leading to a deadlock.
Alongside the Details tab are five other tabs that provide further useful information about the alert. The Output tab is alert-specific, but the others are common to all alerts.
Scrolling down further, we reach the Performance Data section. On the Host machine tab, SQL Monitor provides a snapshot of the values for various server-level Performance Counters, captured around the time that the alert fired.
Figure 9 – Performance metrics capture at the time the alert fired
The gray bar in each graph indicates the time that SQL Monitor fired the alert. A green bar, if there is one, indicates when the alert ended. Note that these bars represent the time that the alert fired and ended, not the exact time that the underlying queries started and ended, although they are a close approximation. An alert won't fire until its threshold is reached, and this will be some time after the query began. Similarly, the SQL Server tab provides snapshots of performance counter metrics for the SQL Server instance.
For even more information, the System processes tab lists concurrent system processes, and SQL processes / Profiler trace identifies the processes running at the time of the alert, plus the trace stack, if you have trace enabled. In this case, this tab will identify who and what was running the session that caused the deadlock (OK, I confess, it was me, in SSMS, armed with a nasty, long-running
Finally, we can see the Top queries from around the time of the alert. By default, SQL Monitor lists them in order of duration, but by clicking on the desired column (such as execution count), we can reorder the results by that column.
Clicking on the long-running query will reveal the full text of the query, its plan handle, the target database, and the query duration. Clicking on the arrow, below the query text will show the query in the context of its containing batch. This is especially useful when the query is run as part of a stored procedure, as it will give you the name of that procedure
Figure 10 – Top queries running at the time an alert fired
Armed with knowledge of the queries involved in the deadlock, we can take steps to resolve the problem, to ensure it does not recur. This might involve query tuning, adding a new index, or modifying an existing one, or some other standard mitigation for deadlocks.
When we start to work with an alert, SQL Monitor marks is automatically as Read. This means that other DBAs or data professionals looking at the Alert won't also start to troubleshoot the issue. Having resolved the problem, it is very important to mark the alert as Cleared.
We haven't got room in this article to walk through all of the 33 possible alerts, but we'll take a brief look at the information available for the two other alerts raised in Figure 7, namely the Blocked Process alert and the Long Running Query alert.
The basic troubleshooting process is similar for every alert, in that SQL Monitor will reveal both the relevant alert details, and performance data captured around the time the alert was raised, which the DBA can use to diagnose the cause. For example, a Fragmented Indexes alert will identify the indexes, their size and degree of fragmentation, a Disk Space alert will identify the space-challenged disk and summarize the size of the databases that store data on that disk, and so on.
Figure 11 shows the top portion of the Details tab for the Blocked Process alert we saw in Figure 7.
Figure 11 – Alert details for a blocked process report
The Details tab displays the time and duration of the blocking event and the database in which it occurred. More useful though is the information presented in the Processes tab, which reveals information about both the blocked process and the blocking process.
Figure 12 – Details of the blocked and blocking processes that caused a blocking alert
Process ID 59, the
UPDATE transaction against the
Person.Address table is blocked. You can see the complete T-SQL statement for the blocking process, shown as ID 57. Armed with the process ID and the start time, we can use the Rewind Time feature (covered shortly) to examine additional queries that were active at this time.
The problem may be a poorly-designed query that needs tuning, or simply an open transaction that needs to be committed or rolled back. In this example, on examining the running queries, the blocking process was identified as an open transaction, affecting a single row, which was dealt with by killing the process and letting the transaction roll back.
Another alert we saw in Figure 7 was the Long Running Query alert. As revealed by the Details pane for this alert, SQL Monitor raised it due to the same blocked query that caused the Blocked Process alert.
Figure 13 – Details of a long-running query, caused by the blocked process
It is quite common to see these two events together. By default, SQL Monitor raises a High severity Blocked Process alert after one minute of blocking, and also a Low severity Long Running Query alert when a query runs longer than one minute. However, we might prefer the long-running query event to be separated in time from the blocked process event, and we can do this easily by modifying the settings for the long-running query alert to make the threshold value shorter or longer than the blocked process alert, using the Configuration page.
In this example, if we resolve the blocking issue then we'll also resolve the long running query problem. However, more generally, we would want to examine the query details, in the accompanying performance data, find out how often the query occurred in the workload, and so on. If this is the first occurrence of the query, it might be a one-off ad hoc query that may never happen again. If, instead, we see many occurrences of this same alert for the same query, then we will want to investigate what application or user is running it, detail we can get from the SQL Processes / Profiler trace tab, as described previously.
Once again, be sure to clear each alert after you address it.
Immediately after an instance is added to SQL Monitor, it begins to collect data and report any alerts, on the Alerts screen. For important alerts, it's advisable to set up email notifications so that the relevant DBA receives immediate notification. Each email contains much of the information on the alert details page. We can configure alert emails to be sent when alerts are first raised, when they are escalated (from low to medium or high severity, for example), and when they are closed.
First, on the Email Settings page of the Configuration section, enter the default email address to which to send notifications along with the mail server settings. Alternatively, we can disable all email notifications, if they are not required.
At Configuration | Alert Settings, we can override the default notification address for individual alerts, as required, or simply disable notifications for that alert by selectingSpecify an email recipient, and then unchecking the Send Email notification to: box.
SQL Monitor's alerting feature works with PagerDuty, which allows email messages to be sent via phone and SMS. This is a third-party cloud-based application subscription service.
On the Overviews page, SQL Monitor presents various aggregations and summaries of its collected diagnostic data, offering a snapshot of the system activity and resource usage at the current time, or at a specific point in time. SQL Monitor presents relevant data at each level, as we can drill down from the server level, to the SQL Server instance and database levels, as summarized earlier in the article.
At the SQL Server instance level, SQL Monitor displays in the Performance Diagnostics section various PerfMon counter metrics for the instance, and also snapshots of data mined from Dynamic Management Views, presented as Top queries that were executing at the specific time and the Top queries captured over the past, say, 15 minutes.
We're going to focus on the wait statistics, because they offer a great way to understand what is causing the system to slow down, right now.
A frequent occurrence for a DBA is to receive a phone call informing him or her that that the server is "running slow" now, or was running slow "just before I went to lunch". With no more specifically-defined issue on which to base their investigation, a good starting point is to review the wait statistics for instance, over the reported period of slowdown.
In order to do this, generally speaking, the DBA would need to be capturing and saving regular snapshots of the
sys.dm_os_wait_stats DMV, and then comparing them to find out 'what changed' over the slow period. A tool like SQL Monitor takes care of this "heavy lifting" for us, and presents a "top waits" view of the data, as shown in Figure 14. By default, SQL Monitor sorts the waits by duration ( i.e. wait time).
Figure 14 – Current Top waits
On my instance, the top four entries show a number of tasks waiting for a long cumulative period, and only using up a certain amount of CPU (as evidenced by the comparatively low Signal wait time). The first is a relatively obscure wait type associated primarily with extended events monitoring and we can be safely ignore it.
For all of the more common waits, SQL Monitor will display helpful descriptive text. In Figure 14, the more interesting are the waits after the first, and in particular the
PAGELATCH_SH wait type. The data shows that 450,675 tasks waited for this resource (i.e. waited to acquire a shared lock on a data page that is in the buffer pool), over the previous 15 minute period.
This suggests a possible issue, so let's drill deeper by clicking on this wait type. The first point to note is that on clicking on a specific wait type, below the list of waits appears detailed information about that wait type, what often causes it and possible diagnostic routes.
Figure 15 – SQL Monitor provides detailed descriptions and troubleshooting advice for each wait type
In the main part of the screen, for the selected wait type, SQL Monitor displays any queries that experienced this type of wait over that period.
Figure 16 – Queries affected by a wait type
In this case, all the waits are coming from a single query (of course, since this is just my test instance, resource usage levels are not terribly high).
We can click on the query to reveal more information about the source of the problem, the query involved, the plan handle to pull the plan from cache, and other waits associated with the query.
Figure 17 – Further details on the query associated with a particular wait type
In this example, we have a fairly complex, recursive CTE query (in the
uspGetEmployeeManagers stored procedure in
AdventureWorks) that uses up a lot of
tempdb space. This causes the latch contention in
tempdb, and the subsequent waits cause the query to be slow running. In short, it needs to be tuned.
Sometimes the phone call you get is about a problem that occurred yesterday. While SQL Monitor maintains and displays real-time data, it also stores past data in a database, which means that we can roll back to see what was happening at a particular point in time. This is great for diagnosing problems that occurred when you were not around. By default, data is stored for 1-2 months, but this is configurable (see later).
To rewind to a previous point in time, we use the aptly-named Rewind Time feature, available at the top of any Overviews page.
Figure 18 – The Rewind Time feature
Use of the control is largely self-explanatory. We can add or subtract minutes or hours by clicking on the appropriate buttons, or click the calendar to choose a specific date and time, or simply type in a date and time (using a 24 hour clock). The orange "skip forward" arrow will return us to the current time.
SQL Monitor presents the same views into the performance and diagnostic data, for the specified time rather than current time, but the information for the databases and alerts is blank.
Figure 19 – The Overviews screen, at the SQL Server instance level, for a previous point in time
Further down the screen, we'll also find the top waits, expensive queries, and so on, for the chosen time. We can go as far back in time as we have data stored in the SQL Monitor database. Having completed analyzing the previous point in time data, re-click the 'skip forward' button to return to the current time.
We use the Analysis screen to build and display graphs of various metrics over different time periods, and to perform baseline testing. This is the go-to location for long term analysis based on the performance counter metrics collected over time by SQL Monitor. We can use this information for trending, historical comparisons, and other types of troubleshooting and reporting.
By default, the Analysis screen will display a graph showing a single metric, Machine processor time (total CPU utilization) for the server first in the list of monitored servers, over the last hour.
Figure 20 – The default Analysis Graph
The x-axis displays the time range and the y-axis the metric value; of course, the y-axis will display a variable range based on the types of metrics selected. As you can see, my system is initially under a reasonably consistent load of about 30% processor time (I have a load generation PowerShell script running), followed by some spikes and rises in load as I ran additional functions.
We can adjust the time range using one of the pre-set options, such as "Today" or "Last 24h", or we can simply use the time range and calendar options. Using the forward and back buttons will display the metric values for the same internal on the previous or next day.
Hovering the mouse pointer over the graph will reveal the exact metric value for a specific point in time.
Figure 21 – Viewing the metric value for a specific point in time
Below the graph is the selection box to allow us to add further metrics to the graph, and so compare different types of activity over the same time period. For server-level metrics, such as processor time, we can choose the server for which we wish to display the metric. For SQL Server instance- and database-level metrics, we can display them for all databases on a specific instance, or for a specific database.
Let's see how that works in more detail.
Consider a situation where the wait statistics are showing high
WRITELOG waits, at certain times, for the
tempdb database, and we want to find out the root cause.
SQL Monitor collects several PerfMon counters related to log activity for a database, so we start by displaying the Log flush waits/sec metric just for the
tempdb database, over the last two hours.
Figure 22 – Viewing a SQL Server metric for a specific database
Next, click the button Add another metric and add the Log flushes/sec metric. Then double click (As Above) for the server selection, SQL Monitor will automatically display the new metric for the same server, instance and database as the previous Log flush waits/sec metric.
Figure 23 – Plotting two metrics over the same period
It's worth noting that, if we change the server, instance or database selection in the first metric (listed in the Show: line in Figure 23), then it will also change for all subsequent metrics for which we selected As Above. It makes it very easy to compare the same set of metrics for different databases.
On the current graph, we can see that the y-axis now display no scale, since the scale for each of the two metrics is different. SQL Monitor simply scales them proportionally within the time frame within the selected time range. Remember that we can still hover over each graph to see absolute values for each metric, at a specific point in time.
In any event, what we're interested in here is not so much absolute values as corresponding patterns in behavior between different metrics. A log flush occurs, writing log records from the log cache to the physical log file, every time a transaction commits and during
CHECKPOINT operations. By plotting the two metrics together, we should see if there is any correlation, i.e. whether 'spikes' in the number of log flushes per second correlates to increased numbers of log records waiting to be written to disk. In this case, we see no correlation between spikes in the number of log flushes and spikes in the log flush waits.
Let's add a third metric to the graph, Log space used.
Figure 24 – A three-metric graph for the
Now, we see a correlation between the behavior in the Log space used and Log flush waits/sec metrics. For the
tempdb database, periodic drops in the log space used correspond closely to spikes in log flush waits.
tempdb database, SQL Server runs a
CHECKPOINT operation when the
tempdb log file reaches 70% full. Since
tempdb always operates in
SIMPLE recovery model, SQL Server truncates the log once the
CHECKPOINT is complete, removing all log records that are no longer required. Clearly, we can see that the spikes in log flush waits correspond to times when SQL Server is clearing space in the
tempdb log file. In this case, there is quite a lot of activity, but that's in keeping with the current load on the system (my test script runs lots of inserts and deletes over and again, in a loop).
As you can see, combining metrics in this fashion allows us to spot correlations in metric patterns.
While comparing various metrics over a given time period is useful, often what a DBA really wants to know, in response to hearing that a server was "running slow" at some random time, is "what changed?" What was happening on the system at that time that was different from "normal"? That's where baselines come in.
To see baselines in action, I'm going to display some metrics for a different, busier SQL Server. Once again, we start with an analysis graph showing a single metric, Machine: processor time, for the selected server, this time for a 24 hour period.
Clicking the Compare baseline button automatically adds to the graph a second set of values for the same metric, over the previous 24 hour period, therefore providing a baseline, or comparison point. In Figure 25, the dark blue line represents the current time range of 24 hours of data and the lighter blue line represents the previous 24 hours' worth of data.
Figure 25 – Comparing today to yesterday
In this example, we do see day-to-day similarities in the data, with similar spikes in CPU usage at 3AM, for example. Sometimes, we will see tighter correlations by comparing to the same day the previous week and we can do this simply by modifying the Compare With value such that the graph compares the last 24 hours with the same time period seven days earlier.
Figure 26 – Comparing today to the same day last week
Now we can see some variations, as well as interesting correlations. The spike at 3AM also occurred seven days ago, but CPU usage sustained a higher level for a period after the spike, around 20% usage compared to about 15% in the most recent 24 hours.
In this manner, we can begin to understand if the current pattern of activity on the server is normal, as compared to previous behavior.
The Extend Baseline button allows us to extend the baseline comparison a set number of previous time ranges. So, for example, if we are initially comparing a 1 hour time range to the same period the previous day, then extending the baseline by 7 time ranges will add to the graph the same 1-hour time range for the previous 7 days.
The example in Figure 27 compares the current 24 hours with the preceding 24 hours, and extends the baseline by 7 time ranges, so what's on display is basically the last weeks' worth of data collections, with each line being fainter the further back in history.
Figure 27 – Extending the baseline to cover a whole week
We can change the time ranges, or we can modify the data selected, based on the information stored in the SQL Monitor database. Obviously, we can't request data from counters that weren't collected, or for time ranges that have been purged from the SQL Monitor database.
In Figure 28, I've opted to compare the current day to the same day the previous week, and then extended the baseline by seven time ranges, so the graph displays CPU usage for the current Tuesday and the previous seven Tuesdays.
Figure 28 – Comparing CPU usage for the same 24 hour period across 7 days
In this way, we begin to get a true idea of what "normal" looks like and can spot unusual spikes and see correlations in behavior.
However, for an even clearer picture of what "normal" looks like, we can display averaged regions, for the baselines, rather than separate lines. To do this, simply click the Regions button at the top of the graph.
Figure 29 – Comparing behavior to averaged baseline regions
SQL Monitor still displays the current data as a line, so we know what happened in the current 24 hour period. The rest of the data has been averaged out and smoothed down.
Essentially, the Regions view is based on the same data, but shows the 0th, 1st, 3rd and 4th quartiles, with quite strong Loess smoothing applied and the area between 1st and 3rd quartiles (the 'mid-range' values) shaded dark blue, and the area between 0th and 1st, and 3rd and 4th (the low- and high-range values) shaded a lighter blue.
With this knowledge, we can ascertain quickly that, in this case, the last 24 hours are not abnormal. In fact, they are exceedingly normal, falling almost exclusively within the medium blue, or mid-range average values throughout the 24 hour period.
Baselines provide an excellent mechanism for making comparisons over time so that we can understand how our system is currently behaving compared to how it behaved previously. The knowledge generated by the baselines allows us to spot trends and identify outliers in order to assist in both administration and troubleshooting of our servers.
It's possible, with a little patience, to write T-SQL or PowerShell scripts to collect some or all of this data "by hand", for example from the various system views, performance counters, Dynamic Management Views, and so on. We would need to collect baseline measurements for each key metric and then capture the data into a table at regular intervals. Erin Stellato in her Capturing Baselines series on SQLServerCentral.com explains how to start collecting this data manually.
The problem however is that writing and maintaining the scripts to capture the required data, the scheduled jobs to capture it, and the tables to store the data becomes, itself, a big task and one that can detract from the more important task of analysing the data to spot problems on our SQL Server instances. A tool like SQL Monitor will remove a lot of this heavy lifting.
Introduced with SQL Server 2012, Availability Groups (AG) are now a fundamental aspect of the High Availability and Disaster Recovery (HA/DR) functionality within SQL Server.
Availability Groups are maintained at the database level, not at the server level, so they bring with them different types of monitoring functionality. Further, the HA/DR nature of AG means that you have to monitor differently on the Primary node, the currently active instance maintaining your database, and any of the Secondary nodes, the available backup instances.
SQL Monitor provides all the additional functionality necessary to monitor your AG for up time, availability, current status and performance with appropriate monitoring, alerts and metrics, on both the Primary and Secondary nodes.
When you add in an Availability Group, the Global Overview screen lists the AG along with the servers and instances contained within that AG, as shown in Figure 30.
Figure 30 – Availability Groups on the Overview screen
If you click on any of the servers or instances, you'll see the standard overview screen for that server or instance. If you click on the AG itself, you'll see a new overview, specific to the AG, as shown in Figure 31.
Figure 31 – Overview of metrics and alerts for a specific Availability Group
There's quite a lot of information displayed on this screen, so we'll break it down in more detail. On the left, you have the basic status and health of the AG service itself, the instances and the databases. You can also see any AG-related alerts that have been fired, as well as the current status of synchronization.
Figure 32 – Availability Group status and health
You can also see the rate at which logs are being transferred to the secondary replicas.
Figure 33 – Log transfer rate to secondaries
You can see various statuses including the settings for Failover including Synchronized and Automatic, Synchronized with No Data Loss, and Asynchronous allowing Data Loss. The Asynchronous connection which allows data loss is what you can see in the final line of Figure 33.
We can drill down into each of the items on this screen. Clicking on the servers and databases will open the standard overview pages for those object types. Clicking on an individual alert will open the Alerts page. The spark lines for the Redo Queue, and others, will open the Analysis page.
For an AG service, you're going to want to ensure that you monitor both the primary and the secondary instances. With that you can then be sure to look at Analysis of the metrics that shows both the load on the primary, but also how well the primary is interacting with the secondary instances as it sends across the transactions and validates their safe arrival.
When you add a cluster that's part of an Availability Group, SQL Monitor automatically detects which instances are primary, and which are secondary. All of the instances are then added.
Figure 34 – Detailed analysis of metrics for primary and secondary replicas
All the information within the Analysis screen for the AG works exactly the same as for the Analysis screen for of any other set of counters. The Analysis screen is useful here because we can add counters from the Primary and from Secondary servers, as you can see in Figure 34. This allows you to see how the different servers are behaving.
SQL Monitor 5 offers a number of new Alerts for Availability Group management, as shown in Figure 35.
Figure 35 – Availability Group-specific alerts
All this functionality will, collectively, allow you to manage your AG instances much more effectively and proactively.
After some time working with a monitoring and alerting tool such as SQL Monitor, we begin to get a good feel for the level of activity, and subsequent alerting, on our monitored servers, as well as for how much monitoring data we're collecting, and the growth rate of the central repository for storing it.
We'll start to understand that certain alerts are either unnecessary in our environment, or require different threshold settings. At the same time, we may discover that our database suffer from particular problems for which there are no pre-defined alerts.
As discussed, SQL Monitor is highly configurable, and it is very easy to change alert settings, and to add new custom metrics and associated alerts.
To customize any alert, either to turn it off or change its threshold settings, navigate to the Configuration page in SQL Monitor and select Alert settings to reveal the full list of host machine and SQL Server alerts.
To change the settings for any alert, simply click on it. Figure 36 shows the configuration page for the Long-running query alert.
Figure 36 – Customizing alerts
We can turn an alert on or off, set the thresholds for when different types of alerts will fire and, for this particular alert, specify particular SQL process names or queries that we want to exclude from the alert. For example, let's say that a once a month, a report runs that typically takes 30 minutes. Since we know this is always the case, and if there is nothing we can do about it, we can exclude it so that we won't be alerted every time it runs.
At the bottom of the screen, not shown, we can specify if and where to send email notifications for this alert, as described previously.
Creating custom metrics is a powerful feature and it is very easy to add one to the default metrics already built into SQL Monitor.
Simply navigate to Configuration | Custom metrics, and use a three page wizard to:
To start the Custom metric wizard, click Create custom metric. Figure 37 shows the top half of the screen. In this example, we're creating a custom metric alert that fires when a database file increases in size. It's a good practice not to rely on auto-growth to grow data files. Instead, we should size the data files at time of database creation to cope with the initial data load plus predicted growth over a reasonable period, and then grow the files manually, as required. With this practice in place, it's wise to set an alert to notify us when a data file grows, because it indicates that the database has grown much faster than predicted, or that someone else has grown the file manually.
Figure 37 – Defining a custom metric to monitor data files growth events
We give the metric a name, a description, and then enter the Transact-SQL code that we want to run to collect the metric data. It is a good idea to create and test this Transact-SQL code in SSMS, first, to ensure that it not only works, but that it is as lightweight as possible. We want this code to run as fast as possible, using the least possible amount of server resources.
In the bottom half of the same screen, we specify the servers, instances, and databases for which the metric should collect data, and how often.
Figure 38 – Defining instances to which the alert applies, and data collection frequency
In this example, the alert applies to all user databases on selected SQL Server instances, and will collect data once every 30 minutes.
On this screen, we also specify whether SQL Monitor should display the collected value, or calculate a rate of change between the current and previous value. This option is often needed for counters that increment continuously whenever SQL Server is restarted.
Notice near the middle of Figure 38, a button called Test metric collection…, which we can use to test that the metric works as intended. I highly recommend you perform this test to ensure that the metric returns exactly the data you expect.
On the second page of the wizard, we can create an alert associated with our metric and set the threshold values on which it will fire, and the severity of alert to raise at each threshold.
Figure 39 – Defining an alert for a custom metric
Since the purpose of the alert here is simply to notify us of any growth event on any database, Figure 39 simply specifies a High severity alert if a data file grows by 1 MB/hr. Alternatively, if we were actually interested in the growth rate, we could set different severity alerts depending on the rate.
The final step is simply to review the custom metric before creating it. If everything looks OK, click on Create metric and alert, and they will act just like any of the built-in metrics and alerts, and will begin working immediately.
Redgate maintains the http://sqlmonitormetrics.red-gate.com/ website, where users can submit, comment on and help improve a community-maintained set of custom metrics, all of which are tested thoroughly by the development team at Redgate, before posting to the site.
On the site, you'll find ready-made and tested custom metrics in categories ranging from auditing to detecting I/O issues to replication and security. If you find a metric that would be useful in your environment, simply click on Install metric to install it automatically in SQL Monitor. Likewise, if you've developed a metric you think others would find useful, submit it for consideration for inclusion on the website.
Although we can start using SQL Monitor "out of the box", most DBAs will want to perform some minor additional configuration, including:
We do all of this work via the Configuration section SQL Monitor's web interface.
A very important consideration is how much historical data to store because, over time, SQL Monitor will store a high volume of monitoring data, both host machine data and SQL Server data, as well as basic alert data and SQL trace data, if enabled.
SQL Monitor offers granular control over the length of time to retain different categories of data, defaulting to between 1-2 months for various types of machine data, 1-2 months for various types of SQL Server data, 1 month for basic alert data and 3 days for SQL trace data.
In most cases, you'll want to keep a month's worth for most types of data, which will make it easier to view trends over the typical business cycle of a month.
To customize the default data purging settings, go to Configuration | Data purging.
We can group monitored SQL Server instances into logical groups, such as "Production" and "Development", or into whatever groups make sense in your environment.
Having done this, SQL Monitor will:
To create a group, navigate to Configuration | Groups, or Overviews | Manage groups.
We can assign users to different user roles depending on their required level of access, as follows:
We create the administrator role on initial startup, and if you're the only DBA who needs to access SQL Monitor then you won't need to create additional roles. To allow others access, but with restricted privileges, simply navigate to Configuration | Manage user roles, define passwords for the other roles, and supply them to the users who require access. If you opt to use Active Directory authentication, you'll probably only create the administrator role, which will be needed if you ever need to revert to SQL Monitor authentication. Other roles will be managed through Active Directory group permissions.
Most SQL Server instances will have maintenance windows where a lot of additional server activity will occur, such as backups, rebuilding indexes, and running
DBCC CHECKDB, all of which can place an extra burden on the server's resources, and could trigger some performance-related alerts that are essentially "false positives".
To avoid this situation, we can set a maintenance window for each monitored server, which tells SQL Monitor not to trigger any performance, but not operational, alerts during this time interval, although monitoring still continues. Unfortunately, it is not currently possible to disable only specific alerts, over specific periods, or set different thresholds for alerts during those periods (such as while database backups are running at 2AM).
To set a maintenance window for the host machine, navigate to Configuration | Monitored servers, and use the Actions dropdown.
In addition to analyzing metrics directly in the SQL Monitor web interface, we can also build SQL Server Reporting Services (SSRS) reports that access the SQL Monitor database directly. There is not room in this article to discuss the topic in detail, but Redgate provides a free SSRS Pack to help in getting started, which includes ready-made SSRS reports on CPU and memory usage, database growth trends, general server properties, and more.
The main point of any monitoring effort, whether homegrown or using a dedicated tool such as SQL Monitor, is to be able to spot problems as quickly as possible, and respond to them before they affect too many people, and to avoid certain problems by predicting when they are likely to occur and taking evasive action.
Our overall goal is to troubleshoot the issues we discover and so improve the performance of the servers and the applications that use them, and to start to see a downward trend in the number of alerts raised, and optimization of resource usage across the servers.