23 August 2019

Guest post

This is a guest post from Phil Factor. Phil Factor (real name withheld to protect the guilty), aka Database Mole, has 30 years of experience with database-intensive applications.

Despite having once been shouted at by a furious Bill Gates at an exhibition in the early 1980s, he has remained resolutely anonymous throughout his career.

He is a regular contributor to Simple Talk and SQLServerCentral.

Phil Factor

23 August 2019

Monitoring TempDB Contention using Extended Events and Redgate Monitor

When the tempdb database is heavily used, processes in any database on the instance will be forced to wait, due to contention as the SQL Server engine tries to manage allocation pages in tempdb. Phil Factor shows how to monitor for signs of trouble.

Guest post

Sometimes, a database will slow down exponentially when it is working hard, with many concurrent users. Well, to be more accurate, all the databases on the instance will slow down.

SQL Server is an inherently parallel and multi-user, and designed to slow down ‘gracefully’, meaning linearly, as the number of concurrent users increases. The more recent the version of SQL Server, the better it conforms to this ideal. Pre-SQL Server 2016, when the tempdb database is heavily used, processes in any user database on the instance are often forced to wait, due to contention when the SQL Server engine tries to allocate pages in tempdb. With persistent high demand and increasing contention, any queries and requests that involve tempdb can become unresponsive. In SQL Server 2016 and later this specific form of allocation page contention is less of a problem because tempdbis better configured during installation and page allocation algorithms have been greatly improved (See Is Tempdb Causing a Bottleneck? for more on this).

In this article, I’ll show you how to capture latch waits, the most telling wait type for detecting stress in tempdb, using an Extended Events session, and then track it over time, using a Redgate Monitor custom metric.

What are the signs of tempdb contention?

There are a few tell-tale signs that tempdb contention is the cause of SQL Server performance and throughput problems. You might notice that, for several databases, response times are longer and disproportionate to the load. You are likely to see a precipitous decline in throughput (measured by such metrics as transactions/sec) across many databases.

When using performance diagnostics, you will see PAGELATCH_UP and PAGELATCH_EX wait types emanating from tempdb. When you recognize these symptoms, it likely that the contention is caused by too many processes fighting to access data in tempdb.

The PAGELATCH_XX waits that are often symptomatic of tempdb contention are non-IO waits to attain latches on data pages that are in-memory (i.e. in the buffer pool). The contention ‘hotspots’, in this case, are the allocation map pages that SQL Server uses to track use of the data pages in tempdb and which pages are allocated to which objects and which databases. Every time a tempdb data page is modified, SQL Server must also modify the relevant allocation pages, each time requiring an exclusive latch, which means this is a serial operation. When lots of small objects are being created and destroyed, then lots of threads need simultaneous access to the first few allocation pages in the tempdb data file, and queues form. Latch contention can happen for several reasons in any hard-working database, such as due to an “insert hotspot” on a clustered index with an IDENTITY key. However, it can sometimes be a particular problem in tempdb, where you can have lots of concurrent sessions, in various databases, all attempting to create and destroy objects in tempdb, and so all attempting to modify various of its in-memory data pages.

You can confirm that requests for tempdb data, or other tempdb tasks, are waiting via SSMS, and you can use the two Dynamic Management Views (DMVs) sys.dm_exec_requests or sys.dm_os_waiting_tasks to give you the “real time” wait profile of a SQL Server instance (the active requests and sessions associated with currently-suspended threads). You can also use sys.dm_os_wait_stats to get the historical perspective on your common causes of waits.

Easing the tempdb contention

Recent versions of SQL Server suffer far less from this problem. In addition to the round-robin allocation across all tempdb data files that was introduced in SQL Server 2016, PFS page allocation is also less liable to contention now that round-robin allocations are performed across several PFS pages in the same data file.

It remains a good policy to increase the number of data files in tempdb to maximize disk bandwidth and reduce contention in allocation structures. Generally, if you have eight or less logical processors then use the same number of tempdb data files as logical processors; Otherwise use eight data files. If you continue to see contention, then increase the number of data files by multiples of four up to the number of logical processors.

If you are using an old version of SQL Server, then Enable trace flag 1118 as a start-up trace flag.

However, when employing the above strategies, don’t forget that if the problem is caused by large and frequent queries that employ bad coding practices, then even the most skilled configuration work can only paper over the cracks. You should also examine the routines and batches that are associated with the problem and look for, and re-engineer, any code smells that are likely to be exacerbating the problem.

Monitoring SQL Servers for TempDB problems

If you query the sys.dm_os_waiting_tasks DMV with a filter like this…

WHERE wait_type LIKE 'PAGE%LATCH_%' AND resource_description LIKE '2:%'

…you’ll see a snapshot of currently active requests and sessions associated with suspended threads that are waiting to obtain a latch on tempdb resources.

However, whatever other evidence you have, the most important information to have is a good baseline for tempdb activity, which captures over time the typical profile for tempdb latching for your workload.

I find that the most telling metric for picking up stress in tempdb is the total duration of latch waits over a time period (I use one-minute time intervals). I’ll show you how to capture this data using an Extended Events session and then track it over time, using a Redgate Monitor custom metric. This will give us our baseline.

Figure 1: Tracking latch wait times in Redgate Monitor using a custom metric

If your baseline for this metrics varies around a particular figure then you’ll quickly notice anything unusual, because if the number of latches increases, or the duration of latch waits increases, or both, it will affect the total duration over the interval.

At this stage, you don’t want the detail, you just need to know that something unusual is happening. The easiest problem to solve is an intermittent one, where there are rogue processes that swamp tempdb. It is quite hard to increase the value a great deal, if you have a lot of well-written routines running concurrently, but a rogue query can cause a very clear spike.

It is always valuable in performance-testing to give the database a realistic workload that includes the routines that are suspect and look for a spike in the duration of latches. When I’m doing development work, I find Redgate Monitor a very handy tool for performance testing because it can allow you to spot problems, and then you can home in on the detail by querying the underlying event data directly. Figure 2 shows the results of using the query I provide later (in Listing 3).

Figure 2: tempdb latching profile over the recent period

Monitoring latch waits using Extended Events

First, we need to set up an Extended Events session to capture the data we need. The sqlserver.latch_suspend_begin event fires when a thread is suspended while waiting for a latch, and sqlserver.latch_suspend_end fires when it resumes.

In fact, we only need to collect the latch_suspend_end event because its duration event field also tells us how long the requesting thread had to wait, in microseconds, for the latch. We can filter to track only allocation contention inside tempdb.

CREATE EVENT SESSION [MonitorTempDB Contention] ON SERVER

ADD EVENT sqlserver.latch_suspend_end(

WHERE ([sqlserver].[database_id]=(2)))

ADD TARGET package0.ring_buffer

WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,

MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,

MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)

Listing 1: Create the MonitorTempDBContention event session

My goal here is to get the total duration of latch waits, in the last minute, so we need to collect all the event data, without any filter except for the database.

It might be useful to collect additional event data such as the session_id and sql_text associated with the session that was blocked, waiting for a latch on tempdb. If you wish, you can add these two global fields, as actions, to the above event session definition. You’ll need to test it out, gauge how much data it collects over a period, and how much overhead it will cause to have it running continuously.

Here, I’ve kept the event session minimalist, collecting only one event, and no additional actions. Later in the article, I’ll show another event session that only collects those events where the duration is significant and collects the additional event data, for investigating the blocked sessions.

The monitoring metric: total duration of latch waits in a minute

With this event session running, we can devise a query, using the collected event data, to find out the total latch wait duration over the previous minute (actually, I use the minute before the previous minute, just to be certain).

DECLARE @Target_Data XML =

(

SELECT TOP 1 Cast(xet.target_data AS XML) AS targetdata

FROM sys.dm_xe_session_targets AS xet

INNER JOIN sys.dm_xe_sessions AS xes

ON xes.address = xet.event_session_address

WHERE xes.name = 'MonitorTempDB Contention'

AND xet.target_name = 'ring_buffer'

);

SELECT Sum(Duration)

FROM

(

SELECT

Convert(datetime2,

SwitchOffset(CONVERT(datetimeoffset,xed.event_data.value('(@timestamp)[1]', 'datetime2')),

DateName(TzOffset, SYSDATETIMEOFFSET()))) AS Thedatetime,

xed.event_data.value('(data[@name="duration"]/value)[1]', 'int') AS [Duration]

FROM @Target_Data.nodes('//RingBufferTarget/event') AS xed (event_data)

)f(TheDateTime,Duration)

WHERE Convert(CHAR(17),TheDateTime,113)

= Convert(CHAR(17),DateAdd(MINUTE,-2,GetDate()),113)

Listing 2: Total latch wait time over the previous minute

If you want to test this out on your local instance, you’ll need to generate a bit of tempdb action. Listing 7 in Jonhathan Kehayias’ article, Optimizing tempdb configuration with SQL Server 2012 Extended Events will do the trick. Alternatively, you can use Barry Young’s or Jeff Moden’s TempDB Stress Tester from here, or Adam Machanic’s SQLQueryStress.

Installing the custom metric in Redgate Monitor

Having created and started the event session, we can install the query in Listing 2 as a custom metric, called MonitorTempDBContention, in Redgate Monitor. If you click on the previous link, you can install the metric directly from the https://sqlmonitormetrics.red-gate.com/ website.

You need to collect the data from tempdb only, every 1 minute. The appropriate alert thresholds are entirely dependent on your system, and you should only set them after establishing a baseline for your working instance. In my case, after some testing, I set it to raise a low alert with a threshold of 1500000, but this varies so much with the system that I’m not going to suggest that it is a good value to use.

In the Analysis section of Redgate Monitor, you can plot the metric over time, and you’ll see a graph like the one in Figure 1. If you spot odd spikes in latch wait duration you might as a first step use similar code to Listing 2, but this time getting a bit more detail.

DECLARE @Target_Data XML =

(

SELECT TOP 1 Cast(xet.target_data AS XML) AS targetdata

FROM sys.dm_xe_session_targets AS xet

INNER JOIN sys.dm_xe_sessions AS xes

ON xes.address = xet.event_session_address

WHERE xes.name = 'MonitorTempDB Contention'

AND xet.target_name = 'ring_buffer'

);

SELECT Convert(CHAR(16),TheDateTime) AS TheTime,

Sum(Duration) AS TotalDuration,

Count(*) AS No_Latch_Suspends,

Avg(Duration) AS Average_Duration

FROM

(

SELECT

Convert(datetime2,

SwitchOffset(CONVERT(datetimeoffset,xed.event_data.value('(@timestamp)[1]', 'datetime2')),

DateName(TzOffset, SYSDATETIMEOFFSET()))) AS Thedatetime,

xed.event_data.value('(data[@name="duration"]/value)[1]', 'int') AS [Duration]

FROM @Target_Data.nodes('//RingBufferTarget/event') AS xed (event_data)

)f(TheDateTime,Duration)

GROUP BY Convert(CHAR(16),TheDateTime)

Listing 3: Total per-minute latch times, and other metrics, over the recent period

Which will give results that will look somewhat like this

Figure 3: tempdb latching over the recent period

The event has more fields, however, and some of them are useful but at this stage we’re just interested in the broad landscape.

Figure 4: Event fields for the sqlserver.latch_suspend_end event

Investigating the causes of Latch contention

When you receive alerts, or spot abnormal patterns of behavior, you can start to drill into the details to discover the cause of the tempdb contention. We’ll want to drill into the detail to see what tasks are getting caught as a ‘contention victim’. When I was testing this, I found to my embarrassment that one of my scheduled tasks was causing contention in tempdb due to an injudicious query that cause the SQL Server optimizer to use a HASH MATCH join.

In Listing 4, I create a slightly modified version of our original event session that gives us more detail about a more select group of contention victims. We will just look at the queries that get held up for more 20 microseconds, but it could be a lot higher in an active system. We only want to see the out-and-out rogues. I’ve also collected some additional event data to capture the details we need about the blocked sessions, and what SQL they were executing.

CREATE EVENT SESSION TasksDelayedByTempDB

ON SERVER

ADD EVENT sqlserver.latch_suspend_end

(ACTION (sqlserver.session_id, sqlserver.sql_text)

WHERE (package0.greater_than_uint64(duration, (20)) AND database_id = (2))

)

ADD TARGET package0.ring_buffer

WITH

(

MAX_MEMORY = 4096KB, EVENT_RETENTION_MODE = ALLOW_SINGLE_EVENT_LOSS,

MAX_DISPATCH_LATENCY = 30 SECONDS, MAX_EVENT_SIZE = 0KB,

MEMORY_PARTITION_MODE = NONE, TRACK_CAUSALITY = OFF, STARTUP_STATE = ON

);

Listing 4: The TasksDelayedByTempDB event session

It is useful to have the session_id, because it is then possible to look in a lot more detail but for my purposes, it is good enough to eyeball the query itself rather than, say, looking at the query plans. Listing 5 shows a query that will tell you about the sessions being caught up in tempdb contention. I’ve given you more columns than you’ll need but you can just comment out what you don’t need.

SELECT

CONVERT(datetime2,

SwitchOffset(CONVERT(datetimeoffset,xed.event_data.value('(@timestamp)[1]', 'datetime2')),

DateName(TzOffset, SYSDATETIMEOFFSET()))) AS datetime_local,

xed.event_data.value('(data[@name="address"]/value)[1]', 'nvarchar(10)') AS [Address],

xed.event_data.value('(data[@name="class"]/value)[1]', 'char(3)') AS [class],

xed.event_data.value('(data[@name="database_id"]/value)[1]', 'int') AS [database_id],

xed.event_data.value('(data[@name="destroy_count"]/value)[1]', 'int') AS [destroy_count],

xed.event_data.value('(data[@name="duration"]/value)[1]', 'int') AS [duration],

xed.event_data.value('(data[@name="exclusive_count"]/value)[1]', 'int') AS [exclusive_count],

xed.event_data.value('(data[@name="file_id"]/value)[1]', 'int') AS [file_id],

xed.event_data.value('(data[@name="has_waiters"]/value)[1]', 'nvarchar(5)') AS [has_waiters],

xed.event_data.value('(data[@name="is_poisoned"]/value)[1]', 'nvarchar(5)') AS [is_poisoned],

xed.event_data.value('(data[@name="is_superlatch"]/value)[1]', 'nvarchar(5)') AS [is_superlatch],

xed.event_data.value('(data[@name="keep_count"]/value)[1]', 'int') AS [keep_count],

xed.event_data.value('(data[@name="mode"]/value)[1]', 'int') AS [mode],

xed.event_data.value('(data[@name="page_id"]/value)[1]', 'int') AS [page_id],

xed.event_data.value('(data[@name="shared_count"]/value)[1]', 'int') AS [shared_count],

xed.event_data.value('(data[@name="success"]/value)[1]', 'nvarchar(max)') AS [success],

xed.event_data.value('(data[@name="update_count"]/value)[1]', 'int') AS [update_count],

xed.event_data.value('(action[@name="session_id"]/value)[1]', 'int') AS Session_id,

xed.event_data.value('(action[@name="sql_text"]/value)[1]', 'nvarchar(max)') AS SQL_Text

FROM @Target_Data.nodes('//RingBufferTarget/event') AS xed (event_data)

Listing 5: Investigating tempdb contention

Conclusion

As database systems take on increasingly heavier loads, there will inevitably be places where data structures must be accessed in a serial fashion within the database engine. This can introduce contention points. This is especially so in tempdb where a great deal of concurrent operations take place. This need not need to be a special problem if the SQL queries that use tempdb are reasonably well designed and operate with a reasonable query plan.

Redgate Monitor helps by establishing a baseline and allowing us to spot those times where high contention happens. We can then use extended events to drill into the detail, find the culprits and deals with the problems. There are several tools, techniques and ways to approach this exercise as well as practices that can be followed in re-engineering applications to avoid them altogether.

References

Tools in this post

Redgate Monitor

Real-time multi-platform performance monitoring, with alerts and diagnostics

Find out more

Redgate Monitor

Real-time multi-platform performance monitoring, with alerts and diagnostics

Find out more

Product articles Redgate Monitor Performance monitoring and optimization
Monitoring TempDB Contention using…

Guest post

Monitoring TempDB Contention using Extended Events and Redgate Monitor

Guest post

What are the signs of tempdb contention?

Easing the tempdb contention

Monitoring SQL Servers for TempDB problems

Monitoring latch waits using Extended Events

The monitoring metric: total duration of latch waits in a minute

Installing the custom metric in Redgate Monitor

Investigating the causes of Latch contention

Conclusion

References

Tools in this post

Redgate Monitor

Redgate Monitor

Product Articles

University

Events

Forums

Community

Simple Talk

Product articles Redgate Monitor Performance monitoring and optimization Monitoring TempDB Contention using…

Guest post

Monitoring TempDB Contention using Extended Events and Redgate Monitor

Guest post

What are the signs of tempdb contention?

Easing the tempdb contention

Monitoring SQL Servers for TempDB problems

Monitoring latch waits using Extended Events

The monitoring metric: total duration of latch waits in a minute

Installing the custom metric in Redgate Monitor

Investigating the causes of Latch contention

Conclusion

References

Tools in this post

Redgate Monitor

Redgate Monitor

You may also like

How to monitor the impact of patching on SQL Server performance

What's new in SQL Monitor 11

Wait Class Monitoring for Oracle in Redgate Monitor

Redgate Monitor Support for Azure PostgreSQL Flexible Server

Investigating problems with ad-hoc queries using Redgate Monitor

Redgate Hub

Product Articles

University

Events

Forums

Community

Simple Talk

Product articles Redgate Monitor Performance monitoring and optimization
Monitoring TempDB Contention using…