Sometimes query performance problems sneak into SQL Server after a software release that involves database schema or code changes, and you must hunt down the offending query and tune it. Other times, however, the performance changes for less immediately explicable reasons. I’m sure we’ve all experienced this situation: your system is humming along, and you’ve tuned queries, updated statistics, added indexes, and done all the good things to ensure that it continues to hum along. Suddenly, however, you get a phone call or instant message or text or an alert, possibly all four at once, telling you that an application process is “running slow”, and what the heck is going on?
You need to fix things quickly, but you don’t even know what happened. How do you deal with this sudden change? SQL Monitor can help diagnose the cause of sudden changes in query performance, by presenting you with the resource metrics, waits, query execution statistics, execution plans and query history, the latter of which highlights changes in behavior as they relate to changes in plan.
Investigating a sudden change in query behavior
It seems that application is running abnormally slowly. So, you open SQL Monitor, navigate to the affected instance ad open the Overview screen and examine the interactive activity graph for the period in question. You can see the CPU, Disk IO and waits have all spiked over this period.
You scroll further down the Overview screen and scan the list of most expensive queries that have run over the time period selected in the focus window of the interactive graph. By default, for the recent 24-hour period, this is the last 2 hours, but is adjusted using the grab bars.
You can see that one query is called far more frequently than any others. You are not surprised to see it there because you know this query well; it’s the one that gets called all the time, repeatedly, throughout the day.
SQL Monitor can show us a lot of details about the general runtime characteristics of the query. What does surprise you is to see that the number of logical reads, per execution, is 93. That seems high to you, since you know that this query only returns a few rows each time. You click on the query to see further details, including the execution plan, any significant wait types associated with the query, and the query execution history.
At the top of Figure 2, SQL Monitor provides a T-SQL statement to allow you to pull the plan from cache, or the option of simply using View query plan, to see it within SQL Monitor. Directly below we see the associated T-SQL Statement. If the query is part of a larger batch, or a statement within a stored procedure, the Show full query button will reveal the full batch or procedure definition, which is incredibly handy when you’re trying to identify where a query originated.
Next, we show the top wait types associated with the query. This only wait type we can see is one associated with on the CPU. This doesn’t indicate a problem, necessarily, but it does mean that the query needs CPU resources and is often waiting to get them.
Finally, we have a history of the performance and behavior of the query over a given time frame, and this section is very interesting. You can see that the number of executions per minute has plummeted over the defined period. By contrast, the duration spiked from “zero” (or very low) to around 300ms, and then back down to a higher, by steady execution time of 30ms. There have also been changes in execution plan over the same time.
These metrics are captured from the Dynamic Management Views (DMVs) such as
sys.dm_exec_query_stats. SQL Monitor maintains the data over time, and you can control the behavior using the drop downs.
Execution plans and query history
At around 10 AM, the number of executions of this query, per minute, has fallen from somewhere in the range of 100,000 every minute to something much less. We can get the exact details by hovering the mouse over the dark line representing executions per minute, to get a pop-up detailing the information captured:
So, instead of 100,000 executions a minute, we’re only getting about 2,000. While the cause might be completely benign, such as an application simply stopped calling the procedure, it might alternatively be the problem that explaina why your boss and the CTO just showed up in your cube. Let’s gather some more metrics so we can give them the story.
We can hover over the Duration line, at a time shortly before and shortly after the change in behavior. Before, we see an average duration of 0.05ms, and once the massive spike up to 298 ms has subsided, the query settled out at a consistently longer duration of about 30 ms:
This large bump up in average duration seems to be a very likely cause of the number of completed executions per minute plummeting. So, let’s examine the execution plan changes that occurred around that time, indicted by the orange Plan changes circles in the graph.
We can click right on each of the plan changes to see what plan was implemented at each point. If we choose the second of the two, we’ll see the plan currently in use (which you could confirm by selecting View Query Plan in Figure 3).
Here the optimizer has chosen to scan the clustered index on the
MovieStageLocation table, reading all 10K rows, looking for those with the requested
LocationName. For each row it finds, it performs a seek of the clustered index on the
Location table for matching rows. All the other tables are accessed with index seeks, and you’re suspicious of the scan on
MovieStageLocation, since you know it only returns a small number of rows.
At this point, we want to confirm that the behavior we’re seeing in the immediate run up to the plan change is, indeed, representative of the “normal” behavior on our system. To do this we’ll need to compare the current behavior to a couple of previous periods over which the query was running normally.
A bit of time travel: what does normal look like?
Generally, if someone had alerts you performance is slower than normal, you’d first want to confirm that this was true, by comparing the server activity over the period of slow performance to the “normal” behavior you expect to see. One way to do this is to use SQL Monitor’s baselines, which allows us to compare resource use, activity and behavior for the current period with, for example, the same period yesterday, or the same period for each of the previous 7 days. This is especially useful if you’ve identified that the problem occurs during the scheduled execution of a job, or report. However, baselines only cover database and server metrics, not query metrics.
So, instead, we’re going to use SQL Monitor’s “back in time” feature to compare query activity over the period of slow performance to a similar period, where the query is being called as usual, but performance is “normal”.
Since the data collected by SQL Monitor is displayed through a web page, we can easily open more than one page. We already have a page open displaying activity over the slow period, so we simply open another page with a connection to the same SQL Server instance, go to the same Overview page, and at the top right of th activity graph, we find the rewind time feature, which allows us to display activity for previous periods of time.
Clicking the date and time, I can select a point in time in the past when I know things were working correctly. Now, when I went home last night, everything was working fine. So, I’m going to select 17:00 on the preceding day:
You can also display the period of data in which you’re in. I’m currently drilled down to show only a one-hour period, but you can easily broaden this out to see a les granular view of activity:
Those are just the defaults. You can also adjust the range manually.
When you drill down on the queries, you can see that our currently-misbehaving query is also top of the Top 10 queries list, and in fact is called even more frequently over this period than the “slow period” shown in Figure 1. However, the number of logical reads per execution is only 7, compared to the 93 we saw before.
Clicking on the query, we see the same query details screen that we saw earlier, in Figure 3.
Clicking on View Query plan, we’d see the same “good” plan that we saw in Figure 8, before the sudden change, which has a non-clustered index seek on the
Location table. The only wait type displayed is, again, the
SOS_SCHEDULER_YIELD wait, associated with yielding of the processor by a process, then waits to get back on it, and is no cause of concern in this case.
In the query history section, we can see that the query performance and behavior has been consistent over this period, with the query generally running in just a little over 0.05ms, which tallies with what we saw earlier, in the period immediately preceding the plan change.
You click on View Query Plan, and you see a similar plan except that the Clustered Index Scan, which I highlighted in Figure 7, is gone. In this plan the optimizer is still performing a seek operation, to retrieve rows from the
Location table, but is now is using a non-clustered index. For each row it finds, it performs a seek of the
MovieStageLocation clustered index for matching rows.
We are starting to piece together a full picture of what seems to have happened. At this time, and in the period leading up to 10AM the next day, we had an efficient plan. The query returns only a small number of rows from each of the tables and is using the available indexes to perform seek operations. The optimizer combines the various inputs using Nested Loops joins, which are the most efficient join types when combining two small data sets.
Suddenly, the plan changes, and the optimizer is scanning all the rows in the
MovieStageLocation table every time the query is called, and it is called very frequently, as we’ve seen. So, this would seem to explain why, after the plan change, the SQL Server is performing far more logical reads to return the required data, the query is running much slower, and the number of executions per minute plummets.
With the information in hand, we can now confidently state that “normally” we see around 100,000 executions of this query per minute, and it runs in about 0.05ms. The reason we’re now getting 2000 executions per minute instead is because, in the blink of an eye, the query started running in about to 30ms. The cause is a sudden change in execution plan, meaning that a 10K-row table is scanned on every execution.
The question is, why has the optimizer suddenly stopped using the non-clustered index? To answer this question, we would need to step outside of SQL Monitor. It’s possible that statistics or data has changed, meaning that the optimizer no longer views it as the most efficient way to access the data. In fact, when I went to check on the statistics, our most likely culprit, I found that the index itself is missing entirely. Further investigation revealed that a junior DBA dropped the index by accident. Recovering the index from source control and putting through an emergency release, I was able to get the index back in place and performance resumed back on normal.
The root causes of changes to performance can vary wildly. It’s not always going to be as simple, or as catastrophic, as a dropped index. However, the diagnostic process, when SQL Server query performance suddenly degrades for no apparent reason, is always going to be roughly the same:
- Examine activity of the period of slow performance
- Use SQL Monitor to compare this to “normal” behavior in the past
- Identify the differences and changes and investigate their possible causes
This simple process should allow you to diagnose a range of different causes, from server settings being changed leading to differences in behavior, to changes in data distribution leading to differences in behavior.
Regardless of the root of the problem, SQL Monitor will provide you with enough data to enable you to understand what the problem is, where it came from, when it happened, and what you need to do in order to fix it.