17 July 2018
17 July 2018

Scaling SQL Monitor to Large SQL Server Estates

Most organizations are finding that the size and number of databases that need to be monitored is increasing, unlike the number of people available to do the work. Now that it is possible for the organization to scale out to cloud-hosted databases, the constraints of the available infrastructure have ceased to govern the growth of the database estate.

SQL Monitor is designed to help you to cope with the demands of bigger databases and higher numbers of monitored SQL servers, because it supports multiple base monitors, and one base monitor will cope with up to  200 servers, with minimal overhead. Even for very large server estates, spread across different networks, the administration team still retain a simple picture of the overall health, performance characteristics, and security of the estate. This ensures they can can still identify bottlenecks, check stress-points before they become problems, and assign priorities to problems.

Database problems rarely appear ‘out of a blue sky’. They generally make themselves known as symptoms, sometimes so slight as to be almost unnoticeable. Any good monitoring system can, of course, diagnose problems and alert the Ops staff, but can also analyze trends to predict future problems. This allows the team to shift from ‘firefighting’ issues as they occur, to being able to fix issues before they become the sort of problems that cause downtime, and the resulting ‘unplanned maintenance work’.

Following along

If you’ve not yet installed SQL Monitor, you can take a tour of the SQL Monitor web interface for SQLServerCentral.com, at http://monitor.red-gate.com. This is live-monitoring the SQLServerCentral.com SQL Server estate, issues and all, and you’ll be able to see in action all the features I describe in this article.

Lightweight monitoring

While the size and number of databases has grown, the size of the operations team generally hasn’t; the same team is expected to somehow scale their existing monitoring solution to cope with the extra demand. This is unfortunate for teams who’ve been using a combination of manual monitoring and custom-built automation, because these tools are often inherently high-overhead. They usually install intrusive Agents on the server, run frequent, complex data collection queries, set trace flags to capture ‘verbose output’, request ‘specialist’ metrics that are complex to collect.

By contrast, SQL Monitor installs no agents and collects just a carefully curated set of diagnostics data for all the common causes of SQL Server-related problems. It resists the temptation to gather large volumes of data of only marginal value, that will simply overwhelm the team, at scale. Instead, it makes it very easy to add custom metrics to suite the requirements of each server and application.

This means that it a single monitoring service can collect and process monitoring data for big databases, and many servers, without adding appreciable overhead.

Distributed Monitoring

Since SQL Monitor data collection is designed to be low-impact, you can add to the list of monitored databases without fear of overloading the system, even when the servers are in different data centers, or on different networks with distinct security protocols.

SQL Monitor supports distributed monitoring via its multiple base monitor feature. Essentially, it means that it can support multiple base monitor stations to collect and process the data, one on each network domain. This will allow you to add servers to your estate, at each location, without needing to open ports to allows access to a remote monitoring service. This allows the team to scale out without introducing bandwidth, security and connection-related issues.

The monitoring data for each location can be sent to a single, web-based monitoring interface, so that the team retains a single view into what is happening across the entire estate. Each base monitor, in each of the three ‘domains’ shown in Figure 1, will typically scale to serve up to 200 monitored servers.

Distributed Monitoring with SQL Monitor

Figure 1: Distributed Monitoring with SQL Monitor

Monitor Cloud-based instances and VMWare

SQL Monitor makes it easy to add and monitor databases or servers running in the cloud, or on VMWare. It is as easy as adding any on-premise database server. It can also monitor Azure SQL Database, Azure SQL Managed Instances and Amazon EC2.

As an illustration of monitoring a cloud-based instance, Figure 2 shows the /Overviews screen for an Azure SQL Database, showing a recurring Deadlock event on the activity graph, and below it all the usual resource metrics, top queries and so on.

Monitoring Azure SQL Database

Figure 2: Monitoring Azure SQL Database

If you have servers running on VMWare, then simply enter details of the vCenter, or the ESXi host, if you’re running a standalone host, and SQL Monitor will automatically detect any SQL Server instances controlled by it.

Adding and Grouping Servers

SQL Monitor makes it very easy to import new SQL Servers into its care, regardless of platform, network or location. From the /Configuration page of SQL Monitor web interface, you can either add SQL Server instances individually, using +Add SQL Server, or you can bulk import a set of SQL Servers, on a given network, from a tab-separated text file.

Importing SQL Servers

Figure 3: Importing SQL Servers

SQL Monitor also makes it easy to categorize your servers into Groups that share a common set of properties or requirements for that group of SQL Server instances. For example, you might have common requirements for production servers, staging or test or development servers.

This makes management much easier, especially as the estate grows, since you’ll be able to filter the various views and summaries to show just one group, and you can also customize the alerts and settings, as required for that group.

Navigate to SQL Monitor’s Configuration page (http://monitor.red-gate.com/configuration, or simply <Your_SQM_URL>/configuration, if you’re following along) to create your groups. Figure 4 shows the group used on sqlservercentral.com.

Creating Server Groups in SQL Monitor

Figure 4: Creating Server Groups in SQL Monitor

You can drag and drop existing servers into this new group, or you can add servers to the group manually from /Configuration/Monitored-Servers.

You might choose to group your servers differently. You might decide to group according to the tier-level of support required by the business. Or, you might want to use the Grouping mechanism so that your server monitoring strategy reflects your organization’s data classification and protection policies. Having reviewed and classified the data you collect and store, you’ll know which servers and databases, across your estate, store commercially-sensitive data (CSD) or personally-identifiable information (PII), and so on, and which extra protection and monitoring mechanisms are required in each case.

A global dashboard for the server estate

If you’re a small team who must monitor a large estate, you need a monitoring dashboard that provides an easy-to-digest picture of activity and any issues, across all your monitored servers. SQL Monitor’s Overview page, or Global Dashboard is the first screen you’ll see on starting up the tool, and it’s designed to provide this clear, bird’s-eye view, regardless of the size of your server estate.

It uses a tile-based display system in which each tile shows a monitored SQL Server instance or monitored Azure SQL Database. It displays servers by Group (displayed on the left) and a traffic-light color coding system helps you spot issues, easily, as they happen in real time. If an issue arises on one of your instances that might compromise data integrity (such as a corruption issue), availability (e.g. an Availability Group isn’t sync’ing), or recoverability (e.g. failed backup job), you’ll know about it instantly because SQL Monitor will raise a high-level alert and that instance’s tile will turn red. If important but non-critical warnings arise, the instance will be colored amber. Completely healthy servers will be green.

The global dashboard for your server estate

Figure 5: The global dashboard for your server estate

You can filter this view, for example viewing only instances on which critical alerts have been raised, or just the instances in a group, and you can order the instances by Severity status, or just by name. Figure 6 filters to show only the servers in the “PII” group. Medium level alerts, such as the Page verification or Job Failing alerts will require immediate attention, on a server containing sensitive or personal data.

Alerts raised on servers containing PII

Figure 6: Alerts raised on servers containing PII

The button at the top-right of Figure 6 activates the “small tile” view, which offers a useful “bird’s eye” view of the large estates.

The bird's eye view of the SQL Server estate

Figure 7: The bird’s eye view of the estate

Estate-wide views and projection graphs

To move away from ‘firefighting’ and towards proactive optimization, your monitoring tool must allow you to predict the future value of a metric, based on the trend. If you are, for example, measuring changes on disk space usage over the entire estate, this will allow you to predict how much space you are likely to need in the months ahead.

These estate-wide views and projections are a new feature of SQL Monitor, introduced in v8, and you can expect more to be added over time.

Disk capacity projection

Figure 8 shows the data growth projection for the sqlservercentral.com SQL Server estate (/Estate/Disks), for disk capacity planning. By identifying trends in disk space usage, you can predict the need for additional resources before they ever become a problem.

Estate-wide disk capacity projection

Figure 8: Estate-wide disk capacity projection

The top chart shows the total data growth projection of the estate, over the coming year, and the table below lists data for individual disks, starting with the one predicated to fill quickest.

The accuracy of these projections depends on the period for which you store historical monitoring data.

Versions estate view

Any DBA needs to apply all necessary OS and SQL Server patches to ensure that their servers, and the data they store, are continuously protected. The Versions estate wide view (/Estate/versions) summarizes the versions and editions of SQL Server and Windows that are deployed on each of your instances, along with service pack information. When you’re monitoring a sizable number of servers, this will help you identify quickly any machines that aren’t up-to-date.

SQL Server versions, editions and patching levels

Figure 9: SQL Server versions, editions and patching levels

In Figure 9, you can see, for example, a SQL Server 2014 instance which has not been updated with service packs since the RTM release. The Service Pack 2 is available, but not yet applied, and it contains several very important security patches, including one that fixes an information disclosure vulnerability.

Diagnostic data in digestible form

DBAs love deep levels of detail, and with SQL Monitor you can drill into the detail as far as you want, right down, for example, to the execution plans, and any associated wait events, for individual queries. However, this must never be at the cost keeping a weather-eye on the whole system. Whatever the level of detail, SQL Monitor continues to collect and present the diagnostic data, so the overall monitoring strategy scales, without overwhelming the team with detail, or noise.

SQL Monitor works hard to make the volume of diagnostic data ‘digestible’ even for very large numbers of servers and databases, with various visualizations, roll ups, groupings, and summaries of the data, to help your top spot issues quickly, and focus your tuning efforts profitably.

Let’s say a server’s CPU is “pegged”. SQL Monitor’s /Overviews page will immediately reveal the top 10 server processes and user processes, by CPU, over that period. Which processes hogged most CPU in that time slice. Do you recognize them? What databases were they accessing? In Figure 10, we can see several user-processes, what they were doing, and what database they were accessing. We know immediately where to drill in to find out what’s going on.

Top 10 server and user processes

Figure 10: Top 10 server and user processes

Similarly, you’ll see quick views of the expensive queries running over that period (up to the top 50), top waits types experienced on the server and details of any blocking processes.

As you drill down, you’ll be able to view, for example, the Query History, i.e. how an expensive query has been performing over time, with any plan changes that could have affected query performance highlighted. You can also view execution plans and detailed performance metrics for individual queries. If blocking is occurring, you’ll see clear visualizations of the blocking chain.

From these high-level representations and summaries, you can drill into the details quickly and intuitively. This will give the team a fast response time to the issues that are raised, even at scale.

What changed? Analysis and baseline comparison

As the database estate grows, it is more and more important to quickly see if anything changed at around the time an issue was reported. I’ve often spoken to long-time DBAs who recall fondly the days when they both built and then maintained a small handful of SQL Servers. They would know each of these machines well, and spot in the blink of an eye, any resource usage pattern that deviated from ‘normal’.

This is impossible for a large and diverse server estate, so you need your monitoring tool to allow you to bridge the familiarity gap. If a user reports that the performance on their application was “slower than normal” that morning, your monitoring tool must allow you to answer several key questions very quickly. Was it slow? And if so, what changed? Were there any unusual patterns in CPU, IO or memory use or that period? What caused it? Was the number of user connections also abnormally high over the same period? Was there a deployment, planned or otherwise? Did someone in operations change a server setting?

These are hard questions to answer, unless you know what “normal” looks like; in other words, you can compare the resource usage patterns over the period the issue occurred, with baselines for the same period yesterday, for example, or for previous 7 equivalent periods.

With SQL Monitor, you can use the /Analysis/Graphs screen to build and display graphs of various metrics over different time periods, and then the Compare baseline and Extend Baseline features to build up a picture of whether the server was experiencing unusual activity levels, or whether the resource spike is part of worrying trend, or whether it’s just “normal” behavior for that set of metrics during that period.

sql monitor baselines

Figure 11: Using baselines

These features make it much easier to understand the impact of a deployment, either of a security patch, or of a set of application or database changes. On the server overview graphs, SQL Monitor will also mark the occurrence of database deployments, although currently only those made by other tools in the SQL Toolbelt.

Scalable alerting

A monitoring system can easily break down when so many alerts are sent that the recipients are overwhelmed by them. As the estate grows, and alerts flood in, the team instinctively ‘protect’ themselves, by arranging that these alerts are diverted to a separate folder rather than going to their inbox. Of course, once that happens, the alerts tend to be quietly ignored.

A monitoring system can suffer one of two faults with alerts: on one hand it can be over-configured, raising too many alerts, and overwhelming the users with information. On the other hand, it can be under-implemented, missing important alerts, or raising alerts that don’t provide enough useful information to pinpoint the problem quickly and act.

With SQL Monitor you can start small, covering just the essential alerts for your SQL Server instances, along with built-in alert thresholds that are easily configured. By using granular tuning, you can tweak the alerting strategy for different groups, servers, instances, databases, or even down to an individual job level. Each alert provides all the diagnostic data you need in order to act.

By grouping alerts, you can make the management of alerts much easier, when expanding the system. Fig 12 shows the Alert Inbox for a SQL Server instance, where rather than needing to review 438 individual alerts in the inbox, SQL Monitor has grouped them, so we can see recurrences of the same problem, such as the same long running query, or the same blocking chain.

Grouped alerts

Figure 12: Grouped alerts

SQL Monitor aims to make “just enough noise” that the right person or team is made aware of the issue, promptly, and can act on the information provided with the alert. Integration via SNMP into tools such as Slack, SCOM or other ticketing systems makes this easier, as well as offering visibility into certain issues to the wider business or IT teams.

Easily adapt the monitoring strategy

Teams that use SQL Monitor soon get to understand what metrics expose most effectively the issues that commonly affect their systems, and what constitutes sensible threshold levels for any alerts. SQL Monitor then allows them to adapt their monitoring and alerting strategy to cater for specific large estate requirements regarding security, performance, availability and so on. With SQL Monitor, you can create a custom metric using any T-SQL command. To get you started, the sqlmonitormetrics.red-gate.com website provides some custom metrics contributed by users, as well as by the SQL Monitor development teams.

For example, if you need to implement different security monitoring for certain servers, according to the data they store, you might choose to use one of the custom metrics in the GDPR, or Security or Auditing categories. On these servers, you should know exactly the number of Principals with sysadmin login, and set an alert if the count goes above this number. For servers that require strict auditing of activity, you can use a custom metric that will query the resulting audit files and return a count of the number of queries against the classified table, and you’ll be able to set an alert if there are sudden spikes in this number. You can create metrics that mine information from the default trace to monitor for events such as a login being added to a server role, or a database user being added, or from the SQL Server error log to track the number of Failed SQL Server logins to a server.

Summary

The relative ease with which new cloud-based, containerized or virtual machine-based SQL Servers can be provisioned means that many estates are growing quickly. DBAs are increasingly finding that they can only manage their workload with the assistance of automated monitoring of the whole range of instances and databases in their care. One database-monitoring system must fit all and be expandable to monitor a burgeoning estate of databases.

SQL Monitor has evolved to take on the drudgery of collecting the essential metrics, scanning logs, establishing baselines and detecting trends, leaving the DBA to do just the skilled work. It is lightweight in the burden it puts on what is monitored, and it can support multiple base stations to collect and process the data, so it can meet increasing or fluctuating demands for operational work to support a large SQL Server estate.

Share this post.

Share on FacebookShare on Google+Share on LinkedInTweet about this on Twitter

You may also like

  • Event

    Microsoft Future Decoded

    Microsoft Future Decoded is your essential guide to disruptive technology and how it will impact people and businesses, today and tomorrow. Back for a fourth year, Microsoft Future Decoded will focus on current and emerging trends at the cross over of business and technology. With cloud, big data and artificial intelligence already rewriting the rules

  • Article

    Unearthing Bad Deployments with SQL Monitor and Redgate's Database DevOps Tools

    Sudden performance issues in SQL Server can have many causes, ranging all the way from malfunctioning hardware, through to simple misconfiguration, or perhaps just end users doing things they shouldn’t. But one particularly common culprit is when deployments go wrong: I don’t know a single DBA who hasn’t been burned by a bad release. To

  • Webinar

    What’s the true cost of manual monitoring?

    While some companies adopt a manual approach to monitoring, high performing organizations are using tools to proactively manage their estate, saving time and reducing errors. This webinar discusses the 5 drawbacks of manual monitoring and the steps you can take to overcome them.

  • Webinar

    How to monitor hybrid and cloud-based SQL Server estates

    As organizations move to hybrid or cloud estates, the perception is that monitoring those servers gets more complicated, particularly if you have different servers on different platforms. In this webinar we show you that actually it is quite easy to keep track of your databases and servers, regardless of where they are hosted, thanks to SQL Monitor.

  • University

    Take the SQL Monitor course

    This course takes you from installation and configuration, all the way up to getting the most out of the advanced features in SQL Monitor to help you proactively monitor your SQL Server estate. Learn how to explore in-depth issues in your environment, run an analysis, manage alerts, create custom metrics, and more.

  • Forums

    SQL Monitor Forum

    Real-time SQL Server performance monitoring, with alerts and diagnostics