Why it makes sense to monitor SQL Server deadlocks in their own Extended Events trace

We recently had customer ask why SQL Monitor creates an Extended Events session to capture deadlock graphs, when SQL Server has a built-in system_health Extended Events trace which also captures deadlock information?

There are a couple of reasons why a dedicated trace is desirable for capturing deadlock graphs, whether you are rolling your own monitoring scripts or building a monitoring application. I like this question a lot because I feel it gets at an interesting tension/balance at the heart of monitoring itself.

Segmentation is helpful for users

One reason that SQL Monitor uses a separate Extended Events trace is to segment off what is used by SQL Monitor. This helps administrators understand what is impacted if they stop or modify an Extended Events session. 

While Microsoft recommends that administrators don’t stop, alter, or delete the system_health session, in practice it’s quite easy to do any of these things. An administrator might assume if they have installed other monitoring software that they could stop, delete, or modify the definition of system_health without impacting the alternate monitoring.

Extended Events sessions have retention policies

The current implementation of the system_health session is that it writes data both to an asynchronous ring buffer target and to an event_file. The event files currently have a maximum file size of 5MB and can roll over to 4 files.

While this is a sensible configuration for most systems, the system_health session collects multiple events. In some scenarios, it’s possible for these events to generate a significant amount of data quickly, which could plausibly use up a lot of the event file space and roll off other events. 

It’s also quite possible for Microsoft to add additional events or change the amount of data retained for these logs at any time.

These factors make it desirable to use a separate trace for distinct events which you care about for monitoring purposes. This way you control your own retention policies and can isolate events to their own traces as needed. 

Deadlock reports are lightweight to collect in Extended Events

When creating any Extended Events trace against a production instance, it’s important to evaluate the performance impact of the events you’re collecting. 

The xml_deadlock_report event, for example, is a lightweight event to collect. 

Other events have a greater impact on the instance. The most famous example of this: starting an Extended Events trace that collects ‘actual’ execution plans using the query_post_execution_showplan event can very quickly slow down a SQL Server instance. This is true even if you have applied a filter to only collect plans for a very specific query! This event unfortunately has a very high overhead which filtering does not reduce. (There are some alternatives, but it gets complex pretty fast.)

Monitoring is an art of balancing between observation and impact

I like this example because it gets at a core challenge of monitoring: we always need to balance the impact of observation with the benefits of the data we collect. This is always a tough problem as you build monitoring software, as monitoring queries are also subject to variances in query optimization and performance in a database, just like any other activities.

In the case of deadlock graphs, the impact of collecting these in a dedicated Extended Events session is low enough that the benefits of segmenting this out are persuasive, in my view.

Want to learn more about deadlocks?