Back in 2012, I was transitioning into the role of ‘Director of Data Services’, at the Fortune 500 insurance organization where I worked. It wasn’t that easy. Previously, as a DBA, I’d spent time developing my own database monitoring and documentation solutions. Now, I needed to alter my perspective to take in the broad landscape of all the database systems we maintained. I needed to seek out the ‘best of breed’ SQL Server monitoring tool; the one that could help me ensure the uptime and performance of the whole range of data services we supported, across the enterprise.
Many companies like mine were getting their first foothold into “the Cloud”, even if it was to test out a development environment. Virtualization for SQL Server was becoming commonplace. Even so, the database monitoring and documentation solution that my team and I had crafted focused mainly on monitoring the infrastructure that I was used to, which was on-premises and was managed locally by a team of skilled and dependable DBAs. The broad view across the organization wasn’t like that and required a different solution.
Six years later, I’m now a database consultant. Data is still data, and I am still fortunate enough to be working with it daily, but a lot else has changed. I work with many different customers whose SQL estates are now spread across a range of hosting environments, including on-premises data centers, hosted facilities, Microsoft Azure and Amazon AWS. The logistics of data connectivity has changed and hybridized. Data services have multiplied; if you are a DBA in 2018 you will more than likely be familiar with terms like Data Lake, NoSQL, Data Factory, Containers, Machine Learning and Power BI. Many of these systems have their own monitoring as part of the service but as whole, it is still a challenge to see the big picture for all services combined.
So, what features of a monitoring solution make it appropriate for covering a polyglot estate of data services? Every feature I identified back in 2012, when I wrote The Strategic Value of Monitoring SQL Servers, still holds true. More than ever, I need the tool to be ‘low footprint’, in both its impact and its demand on services, and non-intrusive. I still need precise, configurable alerting, and it’s become even more important that I can adapt the metrics, and associated alerts, as required for specific business processes that are supported, or the data that is stored. However, I’d also add some new features to the list, essential to managing the uptime, performance and security of a growing number of database servers, spread across an ever-diversifying range of data platforms.
Support for Categories and Groups of Servers
Most of the organizations that do internal database development have created processes and environments that support their development and testing activities. With several teams working on different applications, there could be specific development, QA, staging and production servers for each application.
Production is always “king”. The other environments must be monitored but have different requirements. One would not, for example, expect to be woken one night from a deep dream of peace to deal with a runaway query on a Development server.
There is no one-size-fits-all monitoring and alerting strategy for a modern SQL Server estate. Of course, the tool should collect a sensible, base set of metrics and alerts, for all servers: Beyond that, however, I need the ability to group servers by application, and then further by their specific usage, to allow for a singular set of rules and alerts to be applied to each group. This makes it easy to manage the monitoring strategy across the estate, and adapts it according to, for example, the tier-level of support required for a group of servers.
Cost-based Computing: Counting the Cost
When companies make the switch to a service-based Cloud offering, they are going to be billed monthly for the services they use, beyond the basic database usage tier. They suddenly need to be aware of fluctuations in cost as service demands change. They will often have applications that they support on premises, as well as those that are hosted in a Cloud service such as Azure. Ideally. They would want one single tool that will monitor both the usage patterns, and the cost of the resources.
Fortunately, both AWS and Azure have billing and cost APIs that can be used to pull down and store this information for analysis. It’s possible to do this as a separate, manual process, but ideally it would be done automatically by the monitoring tool. Having daily aggregations of usage and cost for the entire estate, and to drill down into the detail to find the high-cost services, will help DBAs and managers predict future costs and make appropriate adjustments.
Synthetic Workloads and Query Baselines
With a growing server estate to manage, the administration team are required to be “proactive” in their approach to the twin objectives of ensuring uptime and providing database systems that perform well. Their oft-quoted frame of reference is that they need to know that there is a problem before the user does. In some respects, a standard monitoring solution makes this proactive approach easier to achieve. It is easy, for example, to monitor TempDB space usage and predict when the server will run out of space. If I’m alerted to runaway growth, I can identify the offending query and, with business sign-off, kill it.
In reality, of course, it’s impossible to be aware of every problem before the user does. There are just too many variables that can cause applications to slow to a crawl or even bring the SQL Server down altogether. It’s hard to predict, for example, that a stored procedure will suddenly start running 10 to 20 times slower than normal, at 2.35PM tomorrow.
However, there are several ways a modern SQL monitoring tool can help the team manage a larger number of servers, while still minimizing nasty surprises. First, it needs to establish baselines that define the normal patterns of behavior for each metric, and to alert when significant deviations are detected. When this happens, it ought to lead the user towards a possible cause, such as by tapping into the Query Store to find out if it was caused by a change in execution plan.
As the server estate grows, so more of its administration will be done by staff who are skilled but not database specialists. Therefore, it’s important that this data is presented in a simple, graphical so that any user can understand what has changed, and the possible causes.
As well as giving warning of specific problems, the monitoring tool ought to make it easy to get a measure of how well the overall system is performing, via support for synthetic workloads that we can use to verify query performance. This could work almost like an application “speedo”. For example, I’d like to be able to create an application-based performance metric, such as to measure the time taken from order submission to order confirmation. One or more queries could be developed to run at a set interval, and if a duration threshold is met, the monitoring solution would trigger an alert. The alert details should include the current state of resource usage, any blocking and session counts along with the duration of the query in question.
Automated Performance Tuning
Monitoring systems aren’t really supposed to resolve problems, just report them. The idea of moving beyond diagnosis to cure is one that alarms most DBAs. However, as SQL Server estates grows, so does the desire for some of the very basic tuning to be automated by the monitoring tool.
For example, I see no harm in a monitoring solution that can identify poorly performing queries and add a missing index to resolve the problem. After all, this service already exists, at least for Azure SQL Databases. It will identify, and create, indexes that it detects will improve query performance. Having used this service myself in the past I am comfortable with letting a tool perform this work for me, with the obvious stipulations that it will remove the index if it proved to not have beneficial impact.
A SQL Server monitoring tool ought to make a similar service available to all databases. The feature would be configurable to only report on the missing indexes initially and if approved, create the index at the appropriate non-peak time, which it will have also identified, from its knowledge of resource usage patterns throughout the day. Of course, the DBA will still maintain full management over the systems and can override and rollback any decision the service makes.
Alerting on Code or Configuration Changes
In my former life as a DBA, our team were often burned by code or configuration changes that made it to production, either not fully tested, or outside of our change management processes, or often both. This should not be allowed to happen, but I have seen it often enough to know that it is something I would like monitored in real time.
I really want to know when changes happen, what changed and who did it. I should be able to add to the tool’s configuration settings, both the times of the scheduled deployment windows, and the users that have the authority to make changes. That way, the next time an unauthorized user does an ad-hoc alteration of a stored procedure, an alert will fire, and that employee will be fired!
Of course, I am being too harsh. However, I do want monitoring tool that makes it easy to connect events such as authorized or unauthorized deployments directly to their consequences for the performance and stability of the server. Or, to put it more bluntly, it would be cool to be able to point the finger and have the monitoring system back me up.
The Best Feature Ever: Monitoring Migratory Data
As DBAs we know that data migrates around the enterprise, like herds of wildebeests across the savannas. It moves about from one system to another, being shaped and re-shaped, aggregated, cleansed, de-duped and otherwise exhaustively transformed. Sometimes it undergoes this amazing change every night, and even every minute. It may be an Excel file one minute, and a Power BI graph the next.
If at any time during this astonishing feat, a transformation process breaks, then other processes downstream are going to be very upset that they did not get their turn to play with the data and they will complain, in turn, by shutting down, throwing error tantrums and breaking other downstream processes that want the data.
End-to-end, this would make a very complex dependency diagram, and that is what I want the monitoring tool to provide, in visual form. I would like an integrated, interactive and configurable dependency map of all my data flows, from one system to another, and to be able to see when any one breaks, or, better still, when it is about to break.
Furthermore, I would like to be able to predict the impact to processes later in the workflow. For example, what if a file copy process that moves 60 Gigabytes of Excel files fails because the drives ran out of space? How much time does it take to recoup that space and restart that process? How long can the business expect to wait for those reports that were dependent on that data?
I know that everyone has their own ideas for what makes for an ideal monitoring solution for their SQL Server estates. I’ve outlined some of my ideas for how a monitoring tool can help us monitor, analyze and predict resource usage, including costs, across a growing and diversifying estate, and to help the organization to connect server resource usage and error conditions directly to their impact on business processes.
However, if it were possible for a list to scratch, this list here would only scratch the surface of what is, possible to do with an enterprise database monitoring solution. In 6 to 10 years from now, I predict there will not even be a need for monitoring data estates, or whatever they will be called at that time, because they will all be self-healing and living in a server-less, containerized world. I suspect that the good old days will be recalled with nostalgic sighs by old DBAs.
I really hope, though, that there is still a role for the expert, because I plan on working with, and monitoring, SQL Server until I, in turn, need to be monitored for slow or failing internal processes.