Top Six SQL Monitor Metrics for Analysis

Many SQL Server metrics tell their story best when displayed together on the same graph, so you can see how the things you're measuring vary against each other, over time, in order to gain a summary of the stresses on the server. So what six metrics should you choose? Grant comes up with his six favourite diagnostic metrics.

In the new version of SQL Monitor, you can overlay the plots of different metrics on a graph in the Analysis tab. This allows you to get a clearer idea of what is occurring within your system by relating one metric to another, directly. There are so many metrics that you can select in SQL Monitor that the choices can seem bewildering. If I was faced with a poorly-performing system and I had Red Gate SQL Monitor deployed without custom metrics, there are six metrics in particular that I would want take a look at in order to begin to get an understanding of how my SQL Server system is behaving.

1779-80e37b59-c918-451d-836f-e482381e487

Avg. CPU Queue Length

This is a measure of the number of threads, on average, in the processor queue. It acts as an indicator of load of the processor. Yes, you can look at % Processor Time or other counters, but they’re just going to indicate a busy processor, which may not be a problem at all. If threads are waiting on access to a processor, then that is a different matter; that is an indication that you may be experiencing a slow-down in and around the CPU.

Avg. Disk Queue Length

The length of the queue on a disk indicates the extent of the waits that are occurring within the disk subsystem. As with the processor queue, this indicates that your system is waiting on the disks to complete operations:

Always remember that this is not a hard number. It really depends on the number of disk-controllers and disks involved. Just because you see a logical disk as a single entity, you shouldn’t assume that it’s a single disk. With RAID and SAN and other technologies, what looks like a single disk, isn’t. So you need to understand your hardware configuration before this number has real meaning.

Memory Pages/Sec

This counter is simply a measure of the pages being read from, or written to, memory on your system. So this is primarily an indication of load, but specific to the memory of your system. Any value here at a single point in time is only useful when compared to “normal” values from your system over time and under varying loads. Without the ability to compare this value to another, you can’t be sure whether you have an actual issue with memory.

Latch Wait Time

This is a measure that is internal to SQL Server, not a part of the OS like the other measures. A latch is an internal lock that holds on to resources. On its own, this measure of average wait times is merely an indication of load. You’ll need to compare this value to previous values to understand if you have a significant trend that might suggest a bottleneck caused by your disk drives not keeping up with the I/O demands of the server.

Buffer Page Life Expectancy

While this measure is somewhat over emphasized in some quarters and it can be misleading, it is a good general measure of what is occurring within the memory of SQL Server. The measure itself simply says how long pages are staying in memory within SQL Server. The absolute numbers are not meaningful. Rather you need to look at the value over time and compare it to baselines. Over time this value frequently looks like a ever increasing line with occasional drops. The drops indicate some operation which flushes the memory such as nightly data loads in an OLTP system.

Average Lock Wait Time

As with all the other wait measures, the average lock wait time indicates that processes are waiting on resources. In this case, the request for a lock is waiting for other locks to clear. This is another great indication of the amount of load occurring within your system. Again, as with so many other measures, you’ll want to compare this value to previous measures on the same system. You’ll also likely have to plan to drill down to identify exactly what locks are waiting using sys.dm_os_wait_stats.

In Action

The following graphic is a screen capture from a production system showing all six of these metrics laid out together. You can actually see how the activities within each of the metrics relates to the other metrics and shows how performance is going on this system.

1779-1-fb3301b0-95ac-4c9b-98ec-32fa2b2ae

You can see a correlation between lock wait times and disk queue in the middle of the recorded period, possibly indicating a performance bottleneck, but it cleared very quickly. You can see how the CPU queue and the disk queue are frequently unrelated as different types of actions on the system lead to queuing on different resources. Not everything is going to go up and down together. You an also see a classic pattern with the page life expectancy as it constantly grows for a period and then falls off as a data load is processed. In short, this is a good picture into the behavior of this production system that can help us figure out if we’re having performance problems, when they are occurring, and what resources are most affected.

Conclusion

These are not meant to be the ultimate measures of performance of a system. Instead, these metrics provide ways to quickly understand the general performance characteristics of your system. These measures give you a focus to begin your investigation into where and why you’re experiencing performance issues. You’ll still have to do the additional work of drilling down to identify exactly where the performance problems lie, but with these measures on display, you’ll know quickly if you need to start that drill down process.