18 August 2020

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Phil Factor

18 August 2020

Introduction to Analyzing Waits using SQL Monitor

Phil Factor sets out with the modest aim of giving you enough of an introduction to waits to better understand the wait information you get from a SQL Server monitoring tool like SQL Monitor, and the rather overwhelming amount of information available in the underlying DMVs and Extended Events.

Guest post

You are investigating the cause of a performance slowdown on one of your SQL Server instances, and you see this on its Activity Graph, in SQL Monitor:

The timelines (x-axis) shows when SQL Monitor raised a couple of long running query alerts (the blue circles), plus some disk average write time warnings, in orange. There are also significant spikes in wait times and disk I/O usage. Things aren’t going well. What, you ask yourself, is causing that leap in waits? You look at the wait history graph:

Hmm. Zoom in a bit…

So, the spike in waits is due mainly to a spike in wait times for the PAGEIOLATCH_SH wait type, which in short means than user requests are being forced to wait to obtain a latch on a page in the buffer pool, in order to read a page that is not currently in memory.

As I am the man behind the curtain, I can admit at this point that I’m running several requests that are trawling through a cursor-driven process that reads a four-million row temporary table in one of the databases.

However, without that ‘inside’ knowledge, does this wait information point to the cause of stress? I’d say yes, if you can correlate it with query details and performance metrics and have enough of a basic grounding on the conventions of wait stats. Otherwise, you are at the mercy of what one well-known blogger referred to as ‘Wait Astrology’.

Investigating performance problems using waits

It would be nice to think that all SQL batches use code that makes as few demands on resources as possible, uses the appropriate isolation level, limits transactions to just the essential, uses indexes appropriately, and locks resources for the bare minimum of time. In the real world, however, there will be queries that slip through, although it is possible to catch many of these critters with good monitoring in development and testing. Occasionally these clashes only happen under production workloads, however much we try to reproduce them in performance testing.

Monitoring Waits provides you with a different perspective on performance problems in SQL Server, seeing the progress of interacting processes in terms of the time they must spend waiting for the resources they need. In its DMVs, SQL Server exposes wait statistics at the server-level, either aggregated across all sessions, in sys.dm_os_wait_stats, or for currently-waiting (i.e. blocked) sessions, in sys.dm_os_waiting_tasks.

The nice thing about analyzing these waits through SQL Monitor is that is records all this information over time, and you can seem them directly within the context of server activity around that time. This means that you can find out a lot more about what happened by ‘putting the clock back‘. By this I mean that we can check for waits and for blocking processes by setting a sliding window (the focus window on the Activity graph, plus the ‘rewind time‘ feature) to delimit the time of the problems and check on the blocking processes at that time-interval. By selecting the perpetrators in this blocking list, you can check the chain of the victims, and see the SQL being executed. You will also know the session, the application and login for the session and the SQL being executed. This gives you sufficient information to make an educated guess as to the cause.

The basics of waits

In any multitasking system, threads that undertake a process will spend their time either running (CPU service time), waiting to get on the CPU (signal waits), or ‘blocked’, waiting for a resource to become available (resource waits). The total response time for any request is the sum of the service time and the total wait time.

Any thread that is not waiting on resource, but can’t get on the CPU, waits in a queue of runnable workers until it gets a 4ms quantum of active life to run on the scheduler. From the perspective of the CPU, each core will slice its time around all the existing threads, using up to a 4 ms for each runnable thread. A thread is suspended whenever it encounters a resource wait. So, if an executing thread suddenly hits a resource wait, it’s immediately suspended and joins the ‘waiter list’. Alternatively, if a thread reaches the quantum, without completing its task, it yields the CPU and joins the back of the runnable queue. A thread is therefore left either runnable or suspended.

Imagine that you are multi-tasking in the kitchen. You have two distinct tasks (or batches in SQL Server terms) to execute, one of which is to cook pasta and the other is to prepare pesto sauce.

You need boiling water to cook pasta, so you switch on the kettle. The ‘pasta’ task is now blocked, or suspended, but the ‘pesto sauce’ task can proceed so you start chopping the basil and toasting the pine kernels. Now you need to blend these with a bit of olive oil and garlic, but the blender is at the back of the cupboard. In the meantime, the kettle has boiled, so the ‘pasta’ task resumes. From the perspective of the individual threads of work, you experienced both the ‘kettle’ and ‘blender’ wait type, but then who cares?

If, of course, the kettle still hadn’t boiled by the time the pesto sauce was ready, then you’d experience significant ‘kettle’-related waits and your supper would be delayed. However, what is the root cause of the problem? If your analysis of why supper was late was based purely on studying the waits, you could easily blame a faulty kettle, and miss the fact that your son had unplugged it. Similarly, with waits: high wait times for certain wait types can indicate bottlenecks or hotspots in query interactions within the SQL Server instance, but they don’t always lead you to identifying the cause or offering a solution.

Whereas it is easy to demonstrate a single application and SQL statements on an isolated server, it’s often hard or impossible to isolate and replicate production problems that way. The trick is to identify the processes, or the combination of processes, that are causing unusually high waits. What you really need to achieve this is a “per session” or even “per statement” resource (wait and CPU usage) breakdown.

Viewing waits using SQL Monitor

SQL Monitor’s activity graph will show the general trend in overall wait times (ms/s). You can also generate a waits ‘baseline’, which shows the average waits times for the previous 7 periods (so if the period is the “Last 24 hours”, the baseline will represent average waits times over the past week.

This will tell you broadly if requests are waiting longer than average, over the period, which you can so you can correlate directly with any changes in resource use (CPU, Disk IO, and Memory) and with the occurrence of any alerts.

The Top x Waits section, has the wait history graph (shown earlier) plotting either the frequency or duration (ms) of waits (/s) for each wait type detected over the period, and the tops waits table, showing any significant wait types, how long SQL Server requests spent waiting for each wait type, and which queries were affected.

Waits on their own only tell you the sort of problems that are delaying queries, not the underlying cause of the “bottleneck”. The good thing about using a tool like SQL Monitor is that you get a single, coordinated perspective of the waits, the query execution statistics and plans, and the performance metrics (counters) that indicate resource contention. In other words, we can corroborate what we find from analyzing wait statistics with specific resource usage measurements (disk transfer rates, amount of CPU time consumed, and so on), in order to locate the most contested resources on your system, and so uncover bottlenecks.

However, since you cannot currently get baselines for any individual wait types (wait times or count of waiting tasks), for a more historical perspective on the causes for delays, you’ll want to investigate in more detail using the SQL Server tools.

Investigating waits in SQL Server

In SQL Server, we can measure the waits associated with a query or any other process. In fact, SQL Server measures these waits obsessively and creates wait statistics for them, regardless of whether they are important to the database person. They’re all needed by someone, but maybe not you. The statistics are collected by each SQLOS Scheduler and aggregated (i.e. they are ‘server-wide’).

There are several ways of viewing wait statistics in SQL Server.

DMVs

Several DMVs are provided to make it easier to get aggregate wait information. Just a couple of examples:

sys.dm_os_wait_stats will tell you aggregate information about each wait type (running totals, accumulated across all threads and sessions), from when the server was last reset or the wait statistics were manually cleared (using DBCC SQLPERF)

SELECT wait_type, [Percent_Wait_time], wait_time_secs, max_wait_time_secs, Signal_wait_time_secs

FROM

(

SELECT wait_type,

Convert(DECIMAL(9, 2), (wait_time_ms / 1000)) AS wait_time_secs,

Convert(DECIMAL(9, 2),100. * wait_time_ms / Sum(wait_time_ms) OVER ()) AS [Percent],

Convert(DECIMAL(9, 2), (max_wait_time_ms / 1000)) AS Max_wait_time_secs,

Convert(DECIMAL(9, 2), Signal_wait_time_ms / 1000) AS Signal_wait_time_secs

FROM sys.dm_os_wait_stats

) ows(wait_type, wait_time_secs,[Percent_Wait_time], max_wait_time_secs, Signal_wait_time_secs)

WHERE [Percent_Wait_time] > 1

ORDER BY ows.[Percent_Wait_time] desc

The wait_time_secs is the total wait time for all tasks for this type of wait, the max_wait_time_secs is the maximum wait time (ms) for any of the tasks encountering this wait and the signal_wait_time_ms is the amount of time (ms) that any request waited after being signaled that the resource was freed, i.e. it’s the total, cumulative time that threads spent in the runnable queue.

For troubleshooting production servers, you’d want to extend this sort of query so that it excluded a list of ‘less meaningful’ wait types (see for example SQL Server Performance Tuning Using Wait Statistics: A Beginner’s Guide).

sys.dm_os_waiting_tasks– this is for currently executing requests where the underlying threads are suspended, waiting for a resource to become available. In other words, it provides the current wait profile of a SQL Server instance, revealing the wait types that are currently causing ‘blocking’, and so forcing threads to spend time on the waiter list. It doesn’t show any signal waits, or any waits related to CPU scheduling or yielding (so you won’t see the SOS_SCHEDULER_YIELD wait type).

It’s for times when you suspect significant blocking is occurring right now. By joining the results from this DMV to those from sys.dm_exec_requests and sys.dm_exec_sql_text, you can reveal the identity of the blocked and blocking sessions and the associated SQL statements.

If SQL Monitor detects significant blocking, it raises the blocking process alert, and captures the blocking chain, in graphical format, from the blocked process report, so you already have all the information required to diagnose current blocking.

sys.dm_exec_session_wait_stats return the same information as sys.dm_os_wait_stats but with the session_id. It can provide aggregate information about sessions, specific queries and batches for currently active sessions (the information resets for a sessions, when the session is opened, or reset). It’s useful mainly if you know the session_id, so can snapshot the statistics when the query starts, and when it completes.

Here, we are using it to tell us what is happening with each session, and what the last batch was:

SELECT es.session_id, total_waits, ib.event_info AS Last_Batch, status, cpu_time,

memory_usage, logical_reads, writes, row_count, total_elapsed_time,

login_time, last_request_start_time, last_request_end_time, [host_name],

[program_name], login_name, open_transaction_count

FROM sys.dm_exec_sessions AS es

INNER JOIN

(

SELECT session_id, Sum(wait_time_ms) total_waits

FROM sys.dm_exec_session_wait_stats

GROUP BY session_id

) f(session_id, total_waits)

ON f.session_id = es.session_id

CROSS APPLY sys.dm_exec_input_buffer(es.session_id, NULL) AS ib;

Actual execution plans

Since SQL Server 2016, the XML ShowPlan (actual execution plan) provides wait information about an individual query. On first glance, you’d think it would be useful. After all, SQL queries will often have to spend some time waiting rather than executing, either waiting for resources or for the CPU. A query will be a chain of brief activity punctuated by waits for essential preconditions such as a page from the buffer pool to become ready or to obtains a lock to read or change data.

It is, however, difficult to square what you get with what happened, and certain types of wait are left out ‘by design’.

Extended events

Extended events provide a couple of events to track the wait information for currently executing tasks (wait_info), or for completed tasks (wait_completed).

You can see the waits that are associated with the execution of a query, simply by filtering on the text of the query in the definition of the event. However, a better way to get find out about the waits associated with a specific query may be to use Query Store.

Even then, most waits that occur while that query is running aren’t attributable to a particular query, and the query you are filtering on could be the victim rather than the perpetrator. However, if you can relate waits to queries, it is possible to detect signs of bottlenecks or stall points within that query. For example, lock waits indicate data contention by queries; page IO latch waits can indicate slow IO response times; page latch update waits will indicate incorrect file layout. The problem is that, when there are many running queries that are sharing some data pages, you can conclude very little about the occurrence of, say, a PAGEIOLATCH_SH wait.

Where a wait is directly associated with a query, so much of what you see depends on what else is happening on the server and the circumstances under which the query was running. Could waits ascribed to one query really be attributed to another? If you run the same query several times, the wait information you get will vary. The behavior of the caching of data pages isn’t directly correlated with an individual query, after all. It isn’t always useful to try to tease out significant wait information about individual queries on a busy server.

With extended events, it is also possible to get statistics about waits for individual databases just by adding the database name as an action. Here is an example that allows you to look at tempdb and gives you the text of the batch.

IF EXISTS --if the session already exists, then delete it. We are assuming you've changed something

(

SELECT * FROM sys.server_event_sessions

WHERE server_event_sessions.name = 'CheckTempDB'

)

DROP EVENT SESSION CheckTempDB ON SERVER;

CREATE EVENT SESSION CheckTempDB --call it what you wish, of course

ON SERVER

ADD EVENT sqlos.wait_completed(

ACTION(sqlserver.database_name, sqlserver.sql_text, sqlserver.plan_handle)

WHERE (sqlserver.like_i_sql_unicode_string(sqlserver.database_name, N'tempdb'))--,

--we just want things that are tempdb

)

ADD TARGET package0.ring_buffer

--we don't need a more permanent record or a bucket count

WITH (STARTUP_STATE = OFF);

The Types of wait

There are around a thousand different types of wait in SQL Server. They are essential for the Microsoft engineer who needs to understand the dynamics of an extraordinarily complex multitasking system that has its own scheduler, but they aren’t all primarily intended as a diagnostic tool for the end-user.

There are three basic types of wait: Resource, Queue, and External. You will get Resource waits such as locks, latches, network, and disk I/O waits when a worker thread waits for a resource. It must do this because it isn’t available or is being used by some other worker. Queue waits occur when a worker is idle, waiting for task packets for such background jobs, such as deadlock monitor or deleted record cleanup, to be placed into a work queue. External waits are those delays upon in the completion of a task by an external process such as a remote SQL Server. It could happen when an extended stored procedure call or a linked server query is waiting on the other party to finish.

If you’re using SQL Monitor, you’ll find explanations of all the interesting wait types, in the Top Waits listing, as shown earlier. The following are some of the most common types of resource waits.

ASYNC_NETWORK_IO

This happens when the thread is waiting for a remote task to finish before it can proceed. You’ll expect these occasionally, but a sudden increase could signal network-related issues. However, it is most likely to be the result of a client application demanding too many rows in the result and then not being able to consume the results from the SQL Server quickly enough. What you’re seeing is that there are long waits between sending the data and receiving acknowledgements back to the server. One often sees this when application programmers attempt to do filtering on the client-side, rather than allowing it to be done within SQL Server. Check also for unnecessary transactions or transactions held open for too long by the application.

CXPACKET

This wait type always accompanies any parallelism in queries, and it happens when a thread in a parallel process has to wait in the exchange iterator for another worker thread in the same process to complete. Ideally, all threads will complete together, but if they are unbalanced, you’ll see more CXPACKET waits as the quick threads wait for the lower ones to catch up.

The key point is that there will almost always be another underlying wait type that is causing the CXPACKET waits, and the trick is in discovering the underlying cause. For example, if the optimizer parallelizes a query, and you’ve allowed it to use all 8 cores, but your disk system can start to struggle when handling IO requirements for more than 4 cores processing at once, then you will get CXPACKET waits, but the underlying cause is an IO bottleneck.

The most common cause is parallel execution overwhelming some other resource and the most common reason is inefficient queries and poor indexes, causing costly operations like large index scans, which SQL Server parallelizes, but which could have been avoided. It is good to discover the queries that are causing a jump in CXPACKET waits, because it is often a symptom of something awry at query-level. Have you something nasty such as a function in the WHERE clause? Is the query properly covered by a query to minimize reads? Are there unnecessary sorts going on?

If PAGEIOLATCH.* is also high, it is worth checking the query plan for large parallel table scans. Otherwise, if you suspect unwanted parallelism with a skewed distribution of work, it might be worth checking the settings for the Maximum Degree of Parallelism (MAXDOP).

DTC

This wait type only happens if the server is using Microsoft Distributed Transaction Coordinator (MS-DTC), and it happens when the distributed transaction is waiting on an event before it can change state. A single transaction is parallelized on several systems at the same time, and the transaction cannot be concluded until it’s been completed on all the systems.

LCK_M*

This wait type happens when a session is obliged to wait to lock an object by acquiring an Intent Shared (IS) lock because it is already locked by another session. The DMV sys.dm_tran_locks will display the active request to the lock manager for a lock that has been granted or is waiting to be granted.

The various types of lock waits (Bulk update, intent shared, intent update, intent exclusive and so on) occur in different contexts. For example, you’ll get an intent exclusive lock wait (LCK_M_IX) when a query is trying to modify a row but has to wait because another query is reading the same row. You’ll see a large increase in these LCK_M waits if an overambitious data-modification query is causing lock escalation to a table lock, because it is slow, or it is modifying a large table. This query is thereby causing all the other threads that are, at the same time, trying to read the table to have to wait.

An abundance of exclusive lock waits (LCK_M_X) indicate that lock escalation is happening and if these become excessive it is worth checking whether your isolation level is too severe in its restrictions. To be fair, there are many factors that can causing locks to be held longer than usual, such as I/O bottlenecks, unnecessary transactions, over-long transactions, unnecessary lock escalations and slow-running queries escalating locks. Perhaps the best way of avoiding excessive locking because of a long-running data-modification statement is to stop trying to do it in ‘one fell swoop’, and instead break the data-modification process up into smaller chunks, involving a more modest number of rows.

LATCH_*

Latches are general-purpose lightweight ways of ensuring data consistency in SQL Server. An operation is waiting for non-exclusive write access to modify a SQL Server structure in memory other than a buffer or a transaction and include such things as indexes, data pages and internal structures such as non-leaf pages in a B-Tree. They protect access to internal memory structures to prevent several threads from acquire latches on the same structures at the same time. This does not include buffer latches or transaction mark latches. You can get a list of LATCH_* waits from sys.dm_os_latch_stats, which categorizes LATCH_DT LATCH_EX, LATCH_NL, LATCH_SH and LATCH_UP waits together. It is worth inspecting cases of latch contention as it can slow throughput, see Diagnosing and Resolving Latch Contention on SQL Server

OLEDB

When a SPID makes a function call via an OLEDB provider to a remote client, it must wait for the function to return the required data. OLEDB waits happen when the OLEDB client cannot send data as fast as the Server can receive it because either the network or the client is slow. You’ll generally see it when the SPID is waiting for remote procedure calls, linked server queries, BULK INSERT commands, or full-search queries. It can also happen when uploading or inserting many records in a single batch. This wait type is not used for synchronization, but just indicates the duration of calls to the OLEDB provider.

PAGEIOLATCH_*

The wait is generated when threads are forced to wait for shared read access to a page, while it is being loaded from disk. Buffer latches, including the PAGEIOLATCH_EX (exclusive) wait type, are used to synchronize access to buffer structures and associated pages in the SQL Server database. When SQL Server must read pages from the disk into the memory buffer or from the buffer out to disk, it must place latches on the buffer pages while the processes take place. A large total of PAGEIOLATCH waits can indicate that that the disk is being hammered: or, in other words, that the I/O subsystem is under pressure. You’ll also see it if the server is having to deal with queries that, for the want of a suitable index, must use table scans instead of index seeks.

A large number of such waits could, of course, indicate problems with the disk subsystem but it could be that the available memory isn’t sufficient for the workload, and so the buffer pool has to keep swapping pages in and out between memory and disk. This can also happen when or when the buffer pool is under memory pressure from another consumer of memory, thereby reducing the amount of space available for data.

PAGELATCH_*

This wait happens when a task is waiting on a latch for a page that is in the buffer pool. It is a general latch that doesn’t cause many problems for throughput. It comes in several categories (destroy, keep, exclusive, shared or update). It shows itself when a task is waiting on a latch for a buffer that is not in an I/O request. When it is frequent, it generally suggests hotspots, typically in the tempdb allocation bitmap. This contention is caused by the rapid creation and dropping of temp tables. It generally happened when there weren’t enough tempdb files but is less common nowadays. You can also see an increase of this type of wait when you have an insert hotspot. This happens when several threads try to insert rows into a table that has a clustered index with an identity value or date, forcing inserts to the end of the table.

SOS_SCHEDULER_YIELD

This wait type happens when threads exhaust their ration of 4 ms for their scheduling quanta without having to suspend in the face of a wait for resources. You often get a surge in this type of wait if the workload fits easily into buffer memory or if a thread has a long task that has no restrictions such as a long index scan. However, it can also mean that if a query involves a ‘parallelism inhibitor’, such as like a function or table variable, it may be forced to use a query plan that is only able to use a single CPU core. Alternatively, you might not have enough CPU power available to serve the workload; the longer the runnable queue the more you’ll see this wait type.

WRITELOG

When a SQL Server session produces a WRITELOG wait, it’s waiting for a log flush to finish. This will happen during checkpoints and transaction commits and involves writing the contents of the log cache (user delete/update/inserts operations) to the transaction log on disk before telling the end user that the transactions have been committed. When you see a lot of these, it can signify poor disk performance on the transaction log, or contention arising from having the transaction logs for multiple databases on the same physical disk.

Dramatic changes in the WRITELOG waits would certainly require investigation, especially if they coincide with software releases, signaling that something has been introduced that is changed the rate of transaction commits. The most common causes are frequent transaction commits, when you have many tiny transactions forcing frequent flushes of minimal-sized log blocks, or when you have disk subsystem performance issues, or when you have index design or maintenance problems; index rebuilds cause a lot of log traffic, as can page splits.

If you also see a lot of waits of either ASYNCH_IO_COMPLETION, IO_COMPLETION, LOGMGR or PAGEIOLATCH_* , then it could be an I/O problem. If the disk is the bottleneck here, the transaction log should be moved to more appropriate storage.

Conclusion

Whereas a detailed look at waits is essential for the Microsoft engineer who is fine-tuning the SQL Server system, it is not an all-purpose diagnostic performance tool for the average SQL Server developer. Even DBAs will often struggle to reconcile data about waits with the other evidence. However, a study of waits and their causes is important when you are profiling a SQL Server that is running diverse tasks under load. To get the clearest information about what is going on you will need to compare waits over time against a baseline, analyzed together with the counters to tell you what is going on within a highly complex dynamic system.

Tools in this post

Redgate Monitor

Real-time multi-platform performance monitoring, with alerts and diagnostics

Find out more

Redgate Monitor

Real-time multi-platform performance monitoring, with alerts and diagnostics

Find out more

Redgate Test Data Manager

Redgate Flyway

Redgate Monitor

Overview

Protect

Automate

Monitor

Product articles

Performance monitoring and optimization
Introduction to Analyzing Waits using…

Guest post

Introduction to Analyzing Waits using SQL Monitor

Guest post

Investigating performance problems using waits

The basics of waits

Viewing waits using SQL Monitor

Investigating waits in SQL Server

DMVs

Actual execution plans

Extended events

The Types of wait

ASYNC_NETWORK_IO

CXPACKET

DTC

LCK_M*

LATCH_*

OLEDB

PAGEIOLATCH_*

PAGELATCH_*

SOS_SCHEDULER_YIELD

WRITELOG

Conclusion

Tools in this post

Redgate Monitor

Redgate Monitor

Product Articles

University

Events

Forums

Community

Simple Talk

Redgate Test Data Manager

Redgate Flyway

Redgate Monitor

Overview

Protect

Automate

Monitor

Product articles Performance monitoring and optimization Introduction to Analyzing Waits using…

Guest post

Guest post

Investigating performance problems using waits

The basics of waits

Viewing waits using SQL Monitor

Investigating waits in SQL Server

DMVs

Actual execution plans

Extended events

The Types of wait

ASYNC_NETWORK_IO

CXPACKET

DTC

LCK_M*

LATCH_*

OLEDB

PAGEIOLATCH_*

PAGELATCH_*

SOS_SCHEDULER_YIELD

WRITELOG

Conclusion

Tools in this post

Redgate Monitor

Redgate Monitor

You may also like

Redgate Hub

Product Articles

University

Events

Forums

Community

Simple Talk

Product articles

Performance monitoring and optimization
Introduction to Analyzing Waits using…