Production database systems generally go wrong at inconvenient times. If you are a member of a team that has agreed to keep an eye on it, according to the terms set out in a stringent service-level agreement, then you are part of the response, 24 hours a day. You will need to fix the problem within the agreed timeframe, regardless of whether things went wrong at 2AM, or while you’re on the beach with your family.
This means that you cannot rely only on exploratory database monitoring for a production system. Instead, you need an alerting mechanism that feeds into a ‘tiered response’ workflow. It will start with an immediate response by the on-duty Ops staff, in the hope of an immediate fix, such as replacing a broken router. If this is not possible, the workflow escalates to an on-call team who will establish the nature of the problem and likely cause. If it’s not immediately fixable, it is then escalated again to the appropriate expert, such as an Ops DBA.
Knowing the level of service your database system must achieve
Each organization will try to ensure that they can maintain whatever service level is required for each service. The way they do it varies greatly. There are good reasons for this. The service level requirements for a database used by a retail bank will be quite different from those needed by a craft shop making teddy bears.
There are wide differences of scale, differences in the degree to which the functioning of the business is dependent on the database system, and also in the resources that can be thrown at a problem. Organizations will usually want to define up-front exactly what they want from a service, and what level of resources they can provide to achieve it.
A service failure will present risks. For how long can the organization tolerate the database being offline in the event of a crash, or to perform maintenance or deployments? How much data is it acceptable to lose in the event of system failure? All this, and much more, will be defined in a service-level agreement between the service provider and the organization. If this is a “24×7” system with little-to-no tolerance of data loss then the IT department will need to create a workflow that can ensure an immediate response to any incident that threatens the stability or availability of the system
Achieving a rapid and appropriate response
For our hypothetical example, we adopt the three-tier approach. This has three stages of response and, based on my own experience, is typical of many businesses. It is based on the idea that, no matter the scale or timing of the problem, some issues can be fixed rapidly without special technical skills, some can be dealt with as a routine by any experienced ops staff, and only then, when it is outside the ‘routine’, do you get the expert out of bed.
With databases systems, every deployment ought to have accompanying training materials, scripts and documentation, provided by the development team. They should cover every conceivable event that could cause the service to fail and provide details of any possible remedies and procedures that might, at least, allow the operational staff to keep the service running until an expert can perform a proper fix.
Getting information about the database service
Sometimes a service will just disappear without any warning or alerts. It has happened to me in the past. To find out immediately, before the sound of distant screaming can be heard, you need an automated system, hosted separately, that regularly checks the quality of the service. It will check for the obvious issues such as connectivity (can the service be reached?) response time (has the service slowed significantly?) and for severe warnings logged by the system. It is also a good idea to check using any criterion that has been agreed with the business. One of my bosses suggested that the monitoring system should regularly run (and subsequently remove) a ‘ghost trade’. If the company could still trade, there was no emergency, if it couldn’t then we needed to respond.
For normal maintenance, you don’t need a completely separate system to monitor a database system, but for the task of checking for service interruption or intrusion, nothing else will do. Most of the time, however, you can just use alerts that emanate from the database application. These alerts are at the core of any database management for a production database. Normally, they will describe the nature of the event and the suggested severity. They are likely to include:
- Performance Alerts – high CPU usage, memory pressure etc.
- Availability Alerts – database downtime and connection failures
- Storage Alerts – low disk space, disk I/O bottlenecks
- Replication Alerts – such as replication failure or lag
- Security Alerts – unauthorized access attempts, suspicious activity, data access anomalies
- Backup and Recovery Alerts – backup failures, test restore failures
- Threshold-based Alerts – high number of open connections or concurrent transactions
- Error Alerts – critical errors, exceptions, database crashes
- Job Failure Alerts – failed or long running automated jobs such as ETL processes
- Resource Contention alerts – such as lock contention and extensive blocking
- Query Performance Alerts – persistently slow-running queries
Designing the alerting system
As I’ve discussed previously in Database Monitoring for Developers, the means of providing meaningful alerts ought to be developed alongside the database. Developers should rehearse possible causes of failure and understand what sort of monitoring data and alerting mechanisms need to be in place in order to “raise a red flag”. When an alert is received during development or testing, the developers then work out the best way to investigate and resolve the problem and document it.
By following this approach, many of the systems that are needed for detecting problems and raising the appropriate alerts, will already be in place within the database system, at least as SQL queries or scripts, before the first release of the database application.
Many alerts are threshold-based, meaning that they represent a significant deviation from what is considered an acceptable or “normal” range. This suggests that the maintenance of baseline information, which tells you what normal looks like for a set of metrics, is important for defining the nature of the alert. In development, these baselines will be speculative until performance testing is done, and more certain after the first release.
Although database development alerts can be adapted for use as part of the “tier 1 alerting” system, in production, it will require some effort. The most obvious complication is the need for the database monitoring system to tie in and integrate with any alerting/pager system and dashboard that is already in use within the production setting. A third-party database monitoring tool used for alerting during development work cannot be used as part of a broader production system alerting mechanism, unless it has a way to list an event in the alert log and then feed the alerts into the existing Tier1 system. For example, SQL Monitor supports this using Webhooks that can provide database alerts to any notification system. Otherwise, a task scheduler such as SQL Server’s SQL Agent, or the operating system’s own task scheduler, will need to perform the regular monitoring or event handling and, in development, will be valuable for its ability to provides alerts that happen during a test run.
Secondly, while the database development team will be happy with complex or lengthy messages, even raw error messages with trace dumps, the production alerts will need to be much more formalized, for use by Operations staff. Production alerts will need to be restricted to a code, a description of the event and the probable severity level, and each alert will then need to be matched to the resources that should be provided to the recipient of the alert, to allow them to diagnose the problem. These resources will most likely include a short description of what needs to be done, and, where possible, the scripts to do it.
The most important decision, when designing ant tiered response system, will be the escalation path. Teams will often want alerts to be sent to both a dashboard, and a pager system. This is generally done by pager systems, email, SMS, or push notifications. SNMPv3 or webhooks are used for communicating with a central management system. An alerting system can start simply but at some point needs to ensure that on-call personnel are alerted promptly by whatever medium suits them best. In designing or acquiring an alerting/pager system, you’ll probably need to provide a number of different ways of communicating to Ops staff, such as determining the person who is ‘on call’, and their preferred communication method.
Planning for an effective response to database alerts
At the heart of most operational services is structured escalation. An escalation procedure is a structured plan that outlines the steps to take when an issue or incident arises in a database service that requires attention beyond the initial level of support. The escalation procedure should be regularly reviewed and updated to incorporate lessons learned from previous incidents, so as to ensure that the process remains effective and efficient.
Tier 1 response to an alert
Many faults can be fixed by Tier 1 of support before any users become aware of them. The trick is to be alerted to any problems and have to hand the means to perform obvious fixes. This relies on production staff on-site who can check for any obvious fault and follow a prescribed process to see if it can be easily remedied. If the developers of a database had already predicted a particular problem, such as running low on disk space, they can provide instructions, or a batch file, that can be used by the first line of support, often a helpdesk or a monitoring system, to allow them to solve the problem or at least allow the service to resume. The sort of fault that can be fixed at this level is where there is a symptom and the remedy, and the remedy is one that has been agreed with the developers as being appropriate. In cases where the issue is straightforward and can be resolved, the Tier 1 team can take the appropriate actions and document the resolution. If not, they would escalate to a colleague with sufficient experience and training to work out which expert should be summoned.
For each release, there should be a meeting with the Ops people to provide a library of all the possible Tier 1 alerts and solutions, and to establish where an alert requires immediate escalation (pushing the “Big Red Button” – see later). It is worth building up this library as part of the post-mortem after every glitch. The problem checklist that should provide the list of Tier 1 responses will include Hardware Failure, Network Issues, Software Bugs or Error, Data Corruption, Insufficient Resources, Security Breaches, Power Outages, Natural Disasters, Configuration Errors, Software Upgrade Failure, Long Running Transactions, Resource Contentions, Backup/Restore Failures and Disk Space Exhaustion.
When armed with basic instructions of how to deal with each of these categories of event, it is easier to fix or troubleshoot many problems promptly and effectively, but even if this isn’t possible, it will give them enough understanding of the problem to escalate to the next level of response (Tier 2).
Problem escalation to Tier 2
If Tier 1 support cannot provide an immediate solution, the problem must be escalated to the Tier 2 on-ca; team together with all relevant information about the issue, such as its nature, impact, and any initial steps taken.
At Tier 2, the first objective to be sufficiently clear about the nature of the underlying problem as to know what type of specialized expertise is required. It must establish what components of the service are broken and what needs in-depth investigation.
Having done this, it is worthwhile investigating the issue to see if an immediate fix is possible. If so, the support team can take appropriate actions and document the resolution so that it becomes a ‘known problem’ that can henceforth be managed by Tier 1. If the issue cannot be resolved at this level or if it’s identified as a critical problem, the issue can by escalated straight to Tier 3.
Escalation to an expert (Tier 3)
If the Tier 2 team determines that the issue requires specialized knowledge or resources beyond their scope, they escalate the issue to Tier 3. This level of support is likely to be provided by database administrators, security experts, network specialists, developers, or system architects. It may also need the involvement of management in cases where the incident affects the users of the service, or if extra resources outside the team are required.
The experts will review the issue and perform whatever advanced troubleshooting is required, often involving in-depth analysis and investigation. If there is a significant interruption of the service, management are likely to need to provide status updates to stakeholders, ensuring that they’re informed about the progress being made to address the issue, and the likely waiting time before the service resumes. They work to resolve the issue and document their findings and actions.
The big red button
If a critical or urgent issues will result in service interruption, or could result in a data breach, or poses significant risks to data integrity, the database service or the organization, there must be a pre-determined emergency escalation path that allows the rapid involvement of Tier 3 experts as well as senior management or executives.
The responsiveness to database issues must be subject to continuous improvement. This suggests that lessons must be learned from the way that each incident was handled. A way of doing this is to meet after the incident is resolved and record the root cause of the issue, how it was resolved, including the steps taken and any further preventive measures that were then implemented It is a useful time to identify any obvious areas for improvement, and any processes that need to be updated.
A relational database is designed from the ground-up for resilience in the face of the inevitable issues that crop up. To achieve this, it needs to provide as many clues as possible to the nature of any problems. Not only that, but it should be easy to measure the service it provides to make sure it is adequate. Armed with this information, it is possible to confirm what is wrong.
Although any database has intrinsic methods, such as logs, to provide clues as to the nature of an issue, a production database will need a monitoring system that, in addition to providing the common metrics that can warn of impending problems, will also provide an application-wide view of the service, its baselines and thresholds.
It is important that the motoring system provides the best information to help support, and that includes ways to fix routine problems quickly. A structured response service will minimize downtime, improve incident response times, and contribute to maintaining a reliable and stable database service, especially if it is able to learn from experience how to deal with common issues.
Was this article helpful?