SQL Monitor is designed for the busy DBA who has many servers to look after. The idea is to provide simple perspectives, summaries and views of activity, from the broad scope of groups or clusters of servers, all the way down to the intricacies of wait statistics associated with queries on individual databases.
With SQL Monitor 10 we’re making it easier for you to control what you monitor, expand the ways that you configure your alerting strategy, and refine the ways that you can respond to the information gathered. We’ve introduced a range of new metrics and features that we hope will make it easier to understand the root cause of an acute stress condition or instability in a server.
- Automatically configuring and tuning your alerting strategy
- Cause-and-effect Performance Monitoring
- Monitoring Tailored to Your Systems
To try out any of the new features described in this article, you can download the latest version (as a free trial, if you’re a new user) of SQL Monitor from the Redgate website.
Automatically configuring and tuning your alerting strategy
It’s so important that the right people, or the right channels, get the right alerts at the right time. I am an advocate of the need to refine your alerting strategy continuously, much as you would for any other DevOps process. Over-alerting is perhaps the most common reason why monitoring systems “fail”.
A good alerting system shouldn’t, ideally, raise any alerts where no action is possible, or that everyone knows they can safely ignore, because they are irrelevant for a given database, or at a given time. Also, any important alerts that indicate the need for development time, or investigation by the system administrators, should automatically generate tickets to schedule this work, encouraging the process of continuous, stepwise improvement of the database systems.
SQL Monitor 10 helps with all this, allowing you to:
- Configure important alerts to automatically raise tickets in IT Service Management (ITSM) systems
- Automate alert configuration via the PowerShell API
- Create alert suppression windows, such as during maintenance windows (configuration is GUI-only, at time of writing)
- Add Alert summary reports to the SQL Monitor reports page (also GUI-only).
Alert integration with IT Service Management (ITSM) systems
SQL Monitor now supports webhook notifications on alerts, meaning that they can now automatically raise tickets in service management systems such as ServiceNow, Microsoft Teams, Jira or ZenDesk. This allows work to be scheduled without the DBA having to do it manually.
Users can choose which alerts trigger tickets, such as by type or severity, and integrate it with the new time filtering features so, for example, tickets don’t get created during maintenance windows.
Simply go to the Configurations tab and the Notification settings window, and supply us with the right URL:
Alert configuration via PowerShell API
SQL Monitor does the best job it can, out of the box, of setting up a useful core set of metrics and alerts, with sensible thresholds. However, the right alerts and the right thresholds are 100% dependent on your systems. A group or class of servers may all need the same alert types with the same thresholds, but these may well be different from those for other classes of server. Also, your group of VMWare-based servers, for example, may need different thresholds than your bare-metal servers for the same set of memory-related alerts.
Configuring all this in the GUI, server-by-server, can be time consuming and it’s easy to introduce discrepancies. This alert configuration task, just like any other SQL Server management or maintenance task should be automated.
With the PowerShell API, you now write PowerShell scripts to set up the alerts on a machine in a way that is exactly in accordance with your requirements. You then use that as a model to copy all the settings to other machines, or just groups of machines. You can define which types of alert should be enabled or disabled for, say, any server in a cluster, what the thresholds should be, and who should be notified of each type of alert. You can save settings to disk for a configuration management record or install a machine from that record. You can add, update or suspend alerts individually, or across collections of machines, as required. Using these scripted techniques, you can quickly replicate alert settings across the estate and apply the same alerting strategy each time you register a new server of a given class or type.
As a simple example, imagine that we have a very good, tried and tested backup process. We know exactly which databases are in it and which are not, and no longer need the Backup Overdue alert. Navigate to the Configuration tab in SQL Monitor, and click on View example PowerShell scripts under the API section and you’ll find a Disable Backup Overdue Alert script, among others, which will disable this alert for all databases on all SQL Server instances.
Note that if you wish to suspend monitoring for a server completely for a short period, such as during a deployment, you can also do this using the PowerShell API; see the suspend machine example script.
Alert suppression windows
SQL Monitor 10 now allows teams to suppress alerts, or just their notifications, over certain periods. You can apply the same alert suppression window to multiple servers, such as in a cluster, or to just to a single node within a cluster. These alert suppression windows are not yet scriptable by API (but will be soon) so, for now, to create one, navigate to Configuration > Alert suppression in the GUI.
There will be the periods when alerts are all noise, no signal. For example, when you’re performing some scheduled system maintenance, there’s a good chance the servers will be up and down multiple times over the maintenance window, and alerts sch as “server unavailable” are nothing but a nuisance.
Even during normal operation, there may be times when it is inevitable, such as when certain jobs are running on the server, that certain metrics will exceed their configured alert thresholds. Perhaps the security team perform monthly ‘penetration tests’ that cause a lot of failed SQL Server logins. You don’t want to turn off your “failed login” alert completely, but you do want to stop the notifications over that period. The following screen shows an example of an alert suppression window for weekly Sunday maintenance:
SQL Monitor 10 also make it easier to create and share reports of recent types and severities of alert. Navigate to the Reports section, add a new tile, and choose the Alert tile type. Creating these reports is UI-only but should become scriptable soon.
You can filter on alert state, such as all currently active alerts, or specific alert types, such as recent “failed job” alerts. The alert report tile can use the report time range (last 7 days), or the time range defined for that type or status of alert, in an alert filter.
Cause-and-effect Performance Monitoring
SQL Monitor is a “proactive” monitoring tool. By tracking metrics over time, and establishing baselines, we can detect quickly whether any metric or combination of metrics is behaving unusually, and we can predict future behavior based on the trend. This allows the team to see certain problems coming, such as when there are abnormal trends in resource use, or server stress conditions are likely to escalate into a performance problem that will affect the business. This give you the time to act, before they do.
However, sudden, acute performance issues in SQL Server can have many causes, not all of which are predictable, and it is often hard to pinpoint the cause of sudden poor performance, retrospectively, especially once the phones have started ringing. SQL Monitor 10 introduces some features that we hope will make it easier to link cause and effect.
Annotating the server activity graph with specific events
“In my former life as a DBA, our team were often burned by code or configuration changes that made it to production, either not fully tested, or outside of our change management processes, or often both.” – Rodney Landrum, Managing the SQL Server Estate
In SQL Monitor 10, the PowerShell API allows you to write scripts that will send ‘messages’ to SQL Monitor, notifying it of when a deployment occurred, or an important business or maintenance process ran, when a particular application error occurred, when the server was patched, when an ETL operation ran, and so on.
These notifications appear as annotations on the timeline graph for the server, so you can see exactly when they occurred, and the impact they had on resource use, if any. Each different type, or source, of annotation is denoted with its own icon. As you hover over each icon, you’ll see further details of exactly what action occurred (the labels and arrows in red are my own additions):
Having downloaded the PowerShell module and completed the initial setup, you can add one of these annotations by running a PowerShell script that uses the
Add-Annotation cmdlet, specifying as parameters the SQL Server instance where the event occurred, when it occurred, the source of the annotation, a description, a URL to the relevant resource, and others, like this:
Add-Annotation -Target $Instance -UtcDate $date.AddMinutes(-10) -Source "SQL Update" -SpecificInfo "2017 CU19" -URL "https://support.microsoft.com/en-us/help/4535007/cumulative-update-19-for-sql-server-2017"
This should make it much simpler, in many cases, to link a sudden change in SQL Server behavior or performance to some of the common causes. By annotating a change to a server configuration setting, or application of a SQL Update, you’ll see if it has any broad impact on the performance or stability of all server applications, as well as impacts on specific databases.
If you’re deploying new features, or bug fixes, you can gauge directly whether they had any adverse effect, or the intended positive impact. SQL Monitor has marked SQL toolbelt deployments on the timelines since v7, but now you can annotate deployments using any tool, such as Octopus Deploy.
If developers regularly run ad-hoc SQL batches, such as for reporting or to import data, you could supply them with a simple snippet to include in their SQL batches that issues a
RAISERROR with a designated error/notification number. You could then set an Agent alert that fires when this notification is seen and executes a job that uses the PowerShell module to connect to SQL Monitor and add the message as an annotation to the server’s overview graph. For a full demo of this, see: Tagging SQL Server Changes in SQL Monitor.
This way, you will develop a much better understanding of the effect on the database system of each of these potentially disruptive processes, and how to improve them, or better provision resources to cope with them more gracefully.
More effective monitoring of memory usage
If SQL Server is forced to operate with insufficient memory, it can have a severe impact on performance. It will store fewer data pages in the buffer pool, and will flush cached pages to disk more frequently, to make room for newer ones. All of this will cause higher physical I/O and higher CPU, as it writes data in and out of memory.
This complex ‘interplay’ between memory, IO and CPU usage patterns sometimes makes diagnosis difficult. Therefore, SQL Monitor 10 adds a range of new counters relating to the
Manager object in order to give better insight into memory consumption, and memory management issues:
- Plan cache memory
- Connection memory
- Lock memory
- Optimizer memory
- Buffer cache memory
- Granted workspace memory
- Log pool memory
For example, it’s often useful to track the amount of dynamic memory allocated to the buffer cache, and to the plan cache, and look out for frequent, sudden drops. Again, on SQL Monitor’s Analysis page you can find explanations of each metric, plot them on analysis graphs, compare current usage patterns to the baseline, and so on.
Used in conjunction with existing metrics such as Total Server Memory, Target Server Memory and Buffer Page Life Expectancy, it should be easier to diagnose when memory pressure is the root cause of a performance problem. As with all metrics, you can also include them in reports, when you add a new tile to a report, on the Report page.
Investigating the cause of resource waits using SQL Monitor
Fairly early in the process of troubleshooting poor SQL Server query performance, you’ll probably want to examine the wait statistics for the instance, to find out the primary reasons why requests are being forced to wait, and for which resources.
Knowing, for example, that
PAGEIOLATCH_* waits are prevalent over a period of slowdown, indicating that sessions were being forced to wait to obtain a latch on a page, to read it from disk into memory, is useful. However, it does not, on its own, tell you why, and neither can you be sure that a “slow disk subsystem” is the root cause of the IO bottleneck.
Wait statistics need to be correlated with the metrics that provide specific resource usage measurements (disk transfer rates, CPU consumption, memory use), which tell you specifically where queues are forming, and the details of the processes that were running around that time. This is the real power of a tool like SQL Monitor; it pulls together into one place a lot of otherwise-disparate information, allowing us to correlate waits with activity occurring on the server at that time. In SQL Monitor 10 this is even easier, with the introduction of a Waits History graph to the Overviews page for a server.
In the following example, spikes in the
PAGEIOLATCH waits (and some other IO-related waits), in the Waits History graph in the middle correlates with spikes in wait time and disk IO use, and some “long running query” alerts, in the timeline at the top (the arrow are my embellishments).
Given that that a primary cause of
PAGEIOLATCH waits is inefficient queries reading huge numbers of pages into memory in the buffer cache, a great place to start would be to drill into the queries associated with the alerts, and the those in the Top 10 Queries tab. We can examine their query execution statistics, execution plans and so on, and look for way to reduce the IO demand through query tuning and indexing.
Monitoring Tailored to Your Systems
Over the years, SQL Monitor has steadily added support for the various servers, components and SQL Server features that are typical of a hybrid, highly available, architecture. This already includes support for monitoring Azure SQL Databases and elastic pools, as well as Windows-based and Azure-based SQL Server Failover Cluster instances.
To take this a stage further, SQL Monitor can now monitor the performance of SQL Servers running on VMWare, as well as taking on such network tasks as monitoring the use of disk space in Cluster Shared Volumes.
Monitoring SQL Servers through WinRM
In order to take on the monitoring of a broader range of network devices, SQL Monitor 10 now allows users to connect to their servers through WinRM, which is Microsoft’s implementation of the WS-Management Protocol and is a public standard for exchanging machine data remotely.
Administrators can use WinRM to connect to, and collect data from, any computer on a network that supports the protocol, simply and securely. With WinRM, you only need to open two ports on your firewall: it listens on ports 5985 (default) and 5986 (HTTPS). Either NTLM or Kerberos are used for authentication, and you can also configure WinRM to use SSL.
It is simple to Configure SQL Monitor to connect to a server using WinRM. Having ensured that port 5986 is open on the remote server, you’ll need to set the credentials for connecting to the monitored server (Configuration > Monitored servers), changing the WMI connection method to WinRM, from DCOM (Distributed Component Object Model).
That’s it. The base monitor service will now connect over WinRM, and your servers and instances will be monitored the same as any other connections that you make through more traditional means.
Monitoring SQL Servers on VMWare
If you’re running SQL Server on a VMWare box, and its performance is suffering, the task of diagnosing the root cause of the problem is no different from any other SQL Server. However, when a SQL Server is running within a VM, the way that the host machine allocates resources to the individual VMs is an extra factor that you’ll probably need to consider. This requires a few additional, ‘specialist’ metrics, tailored specifically for SQL Servers running on VMWare, and which SQL Monitor 10 now provides.
Once you’ve added a VMWare host (vCenter or ESXi), simply click on a server and scroll down to see plots of the dedicated metrics that will allow you to diagnose CPU, memory, and IO uses, on your VMWare machines:
We need these additional metrics to help us understand resource allocation, use and contention, both on the VMWare server and on the individual Virtual Machines running the SQL Server instances. Generally, the host machine’s resources are over-committed. For example, the hypervisor can overcommit CPU by allocating two vCPUs to each of ten virtual machines, even if there are only twelve physical CPUs on the host machine. Likewise, the host’s memory is likely overcommitted too, since the hypervisor can share memory among machines to cope with demand and, if necessary, recover memory from VMs that aren’t using it and redistribute it to those that need it.
This is fine generally, but if the host is too over-committed, then during busy periods it can cause resource allocation contention within the VMWare system. These specialist VMWare metrics will help you spot signs of these problems. For example:
- Guest CPU ready time – consistently high values indicate that the vCPUs for a VM are often waiting in a queue to be allocated processing time (in this example, although there are 20 vCPUs allocated, only 12 physical CPUs can ever be running tasks for each the VMs, at any time).
- Effective VM speed MHz – if values are consistently a lot lower than the theoretical maximum (host processor speed times the number of vCPUs) you’ll need to investigate why a VM is not getting its allocated “CPU power” from the hypervisor.
- Guest Memory Ballooned – If the host is under memory pressure, it will use techniques like ballooning to reclaim memory from the guest VM
- Memory Swap Out Rate – more drastically the host may use memory swapping, forcing in-memory pages on the VMs to be written to disk.
Ballooning and memory swapping will cause SQL Server to take sometimes drastic measure to reduce memory usage and can cause severe performance problems, so you’ll want to detect these problems quickly and set alerts.
Over time, as SQL Monitor establishes baselines for these metrics, you can identify if a guest virtual machine, or host server, is being overutilized (or even under-utilized). This will also help you maintain a healthy balance of resource allocation, determine appropriate “reservations” for a VM, if necessary, and so on. If a guest VM’s CPU or memory or IO resources are under-allocated, then SQL Server performance will suffer. However, if a VM is over-allocated, then it can cause problems on other VMs.
The Analysis page in SQL Monitor provides detailed descriptions of all the VMWare metrics. You can plot them on an analysis graph, which allows you to correlate their behavior with other standard machine and SQL Server metrics. This will help you see if a SQL Server instance is memory-bound (see later) or CPU-bound. This will also help you determine, quickly, if the cause of a problem is SQL Server-related (inefficient queries, poor indexing, and so on), which SQL Monitor in turn will help you diagnose, or rather a problem with resource allocation in the VMware system.
Monitoring Cluster Shared Volumes
SQL Server 2014 and later supports the deployment of a Failover Cluster Instance (FCI) with Cluster Shared Volumes (CSV), which offers several advantages over deploying SQL Server on “traditional” cluster storage. For example, with Hyper-V CSVs, you can place multiple VHDs on a CSV disk, and associate this disk with multiple VMs, each running on different nodes in the cluster. This means that multiple SQL Server instances can share the same storage. Since each node in a cluster shares read-write access to the same storage, a SQL Server failover doesn’t require dismounting and remounting of disk volumes.
Clearly, you will need to monitor disk space use in each CSV carefully, and SQL Monitor 10 now adds support for this. The new cluster shared volumes space metric, and associated alert, will track how much space is used on each CSV, and warn you if it exceeds a defined threshold. This metric is reported in the Disk Usage section of the Overview Page for every SQL Server instance in the Windows Server Failover Cluster (WSFC).
The Disk Usage graph on the Estate page also now accounts for space used in cluster shared volumes.
Somewhat hidden, but this metric is also reported on the Cluster and Availability Group Overview screen, which you can get to by clicking on the ellipsis at the top right of a dashboard tile for a node in a WSFC (or AG) and navigating to the cluster machine.
It isn’t every development team that glows with pride when you comment that the tool looks much like it always did, but with SQL Monitor this is intentional. The SQL Monitor team must tread carefully when adding features, to maintain the design philosophy of keeping things as simple as possible for the user. Each feature must be there when you need it, and placed where intuition tells you it should be, but in the meantime, it mustn’t get in the way.
Without compromising SQL Monitor’s ease of use, they’ve been developing the connectivity to monitor the whole range of SQL Servers and cover even the darker corners of your SQL Server and network architecture. The aim is to ensure the right people are alerted to problems at the right time, and to help the team understand and diagnose the cause of sudden stress conditions and blockages, simply, wherever they occur in your SQL Server estate.