How to optimize planned availability group failovers in SQL Server

Comments 0

Share to social media

We often perform planned availability group failovers in SQL Server for maintenance, patching, upgrades, and even hardware rotation. Typically, our failovers are fast, but sometimes they take longer – and it’s not always intuitive why, as there are no obvious ties to time of day, database size, or transaction volume.

Shaving even a handful of seconds from the process can improve the application and end user experience; it can also drastically reduce alert noise or, at least, how long alerts have to stay muted. There’s a lot of material out there about performing AG failovers correctly (no data loss), but far less that focuses on shortening the disruption window. The difference is usually some combination of redo volume, checkpoint behavior, open transactions, and secondary readiness.

I wanted to share some techniques I use to make planned failovers faster and more predictable. Some of these techniques are well documented, while others come from real-world patterns I’ve observed across many SQL Server environments. I’ll talk about what I do before, during, and after the failover to minimize disruption and increase the chance that end users are oblivious that anything happened.

How to prepare for a SQL Server failover: a checklist

There are several items on my pre-failover checklist that can help with operational times. Some are set-once configuration options, and some are run-once tasks:

Break apart your availability groups

Back when we had problems with failovers, one of the contributing factors was too many databases in an AG (nearly 400). When you fail over 400 databases at the same time, you’re putting a lot of stress on the soon-to-be-primary, and it can’t process all of them at once. Databases waiting to transition are stuck in limbo, and only increase the disruption window.

The bottleneck is the set of parallel recovery threads allocated per database; more databases means more concurrent recovery workers competing for CPU and I/O. You may see errors like this one while attempting to bring many databases online:

That doesn’t mean just increase max worker threads, which may be counter-productive unless you also increase the number of CPUs able to conquer the work and somehow address bursts of concurrency.

Note that, unless you move AGs to other replicas, breaking your AGs into smaller sets won’t reduce the overall process time, since you just have to fail over more AGs. But it sure will make each AG finish its own failover quicker.

Enable indirect checkpoints

Any CHECKPOINT on the primary can reduce redo time, because the secondary will have fewer dirty pages to recover. This does not mean I can eliminate redo altogether, since it depends on the redo backlog and how much log is being generated during the failover window.

Depending on the version of SQL Server, and where the databases were originally created, they may still be set to use old-style checkpoints. The newer approach of using indirect checkpoints can have many performance benefits, most notably smoother I/O patterns. I talk about some of the benefits in Why Enable SQL Server Indirect Checkpoints and “0 to 60” : Switching to indirect checkpoints, and you can see Microsoft’s explanation of the changes in Changes in SQL Server 2016 Checkpoint Behavior.

To find any databases using legacy checkpoints:

To fix:

On modern versions of SQL Server, I haven’t found any compelling reason to keep the legacy checkpoint behavior.

Run manual checkpoints

Automatic checkpointing often lags in environments like ours, with more databases than CPU cores, because the engine cycles through databases cooperatively. Indirect checkpoints can help, but may struggle to keep up under extreme write pressure or when a lot of databases are involved.

Before failovers, I’ve been manually running checkpoints and, while this is anecdotal, have observed substantially reduced failover times as a result. Admittedly, I had to spend more time up front waiting for those checkpoints to run, but extra time before the failover is better than spending more time on redo during the failover.

I can generate the commands to run checkpoints in each database for a given AG, with an overly cautious filter to triple-check that I’m on primary:

If this takes a while, I run it again to capture any new bursts of write activity that occurred after the first pass. And I fail over that AG before I start checkpoints on the next AG.

Something I can add to the query to minimize any gaps is an ORDER BY to sort the databases so that the busiest databases run last (at least when I know or take the time to determine which databases are busiest, or have the highest page churn.)

On our primary public system, there should be no surprise which database is our busiest; I run CHECKPOINT in a loop against only that database, every dozen seconds or so, right up to the point I initiate the failover. But this is in a very specific, very busy database; once dirty page volume is already low, repeated checkpoints may provide diminishing returns and are generally likely to just move work instead of reduce work.

Fast, reliable and consistent SQL Server development…

…with SQL Toolbelt Essentials. 10 ingeniously simple tools for accelerating development, reducing risk, and standardizing workflows.
Learn more & try for free

Disable long-running processes

Failover cannot complete until the secondary has applied all log records sent from the primary via the send queue. The send queue size directly influences failover time, as Microsoft documents in Monitor AG performance – along with a detailed list of operations in the data synchronization process. Even a modest backlog slows recovery; quieting heavy workloads before failover allows the queues to clear.

We disable our log backup jobs shortly before the failover to eliminate additional log generation and movement. We avoid performing failovers anywhere near full or differential backups (if these conflict, we adjust the start date of those jobs to well after the expected end of the maintenance window). We also disable any jobs known to perform heavy writes or long-running transactions – think ETL-style jobs and index maintenance – which can extend undo and redo time.

Consider enabling Accelerated Database Recovery (ADR)

If you’re reading this today and your failover is planned for tonight, it’s almost certainly too late for this, but Accelerated Database Recovery (ADR) is a setting we have enabled everywhere and has helped in many aspects – including detaching recovery time from transaction length (I talk about how this helps in Accelerated Database Recovery in SQL Server 2019).

For failovers specifically, this makes undo asynchronous, though it can’t change redo… so even with ADR, disabling long-running processes and anything else you can do to minimize redo time will still be a win.

Disable read-only routing

If the primary can handle it, I disable read-only routing several hours before the failover, or before whatever action is leading up to the failover (like patching, upgrading, or rebooting the readable secondary first). This doesn’t make the failover faster, but it makes the pre-failover and secondary maintenance phases safer.

Depending on how much read-intent traffic the applications are sending to the secondary, it can take a while to drain. The application may fail if it is using connection pooling and some of those sticky sessions are still holding and using connections to the secondary while patching or rebooting.

If I can roll through the application nodes and recycle them or otherwise flush their connection pools, any re-established sessions should only be connected to the primary. I monitor the active connections on the secondary and make sure they are winding down to almost nothing (ignoring usual suspects like monitoring tools and system sessions).

Warm the secondary

It can be very helpful to warm up the plan cache and buffer pool on the current secondary before it becomes primary. This is especially true if I’ve restarted the secondary before the failover, since its plan cache and buffer pool will both be empty (and because, hopefully, no read-only traffic is going there yet).

I warm the plan cache by executing a battery of known, expensive SELECT queries against the secondary (after it is restarted, if it is being restarted), to avoid a compile storm when it becomes primary and all those queries suddenly hit it all at once.

An even bigger benefit is filling the buffer pool so that users bypass I/O-related waits while those pages get loaded into memory. To do this, I perform a series of queries that cause full scans of our largest tables and most frequently used indexes. I like the COUNT_BIG(*) approach, which forces a full index scan and minimizes per-row CPU overhead. I’ve seen techniques out there like CHECKSUM_AGG(CHECKSUM()) which is likely to punish my CPUs a lot more.

This does not guarantee identical performance immediately after failover, but significantly reduces the initial shock of cold cache and compile storms.

Note that these options aren’t possible in an upgrade scenario; the secondary won’t be readable after it has been upgraded but before it has become primary. Microsoft says, “Only replicas that are on the same major build of SQL Server will be readable.” In that scenario, it’s still something I can do immediately after it has become primary; it’s certainly a better end user experience if I suffer some of those resource pains instead of them.

Utilize Query Store trace flags

In environments where I use Query Store, there are two trace flags that can help reduce the time spent during failovers, service restarts, and reboots, especially if Query Store is large.

  • TF 7745
    This prevents Query Store from flushing its data to disk during database shutdown (at the cost of losing any data still only in memory, which is acceptable to me). Erin Stellato talks about this trace flag in Query Store Best Practices.

  • TF 7752
    Prior to SQL Server 2019, default Query Store behavior was to load synchronously, blocking queries until it was fully loaded (you may experience QDS_LOADDB waits on startup). Under this trace flag (and by default since SQL Server 2019, where the trace flag becomes no longer necessary), Query Store loads asynchronously and is read-only until it has loaded. Erin also talks about this behavior in both the aforementioned article and in Query Store Trace Flags.

Neither trace flag affects redo, checkpoint behavior, or failover logic… but they can remove delays that contribute to longer perceived failover times.

Validate configurations

Before the failover, I make sure that server-level objects like logins and linked servers are consistent and functioning correctly. I validate that SQL Server Agent jobs are identical and, when necessary, are primary/secondary aware. If any of these are misconfigured after the failover, they will slow down my ability to get back to 100% functionality, and users may notice.

Check potential redo time

You can calculate a rough, rough, really rough estimate of how long a failover might take if initiated right now, based on send and redo queue size, redo rate, and the age of active transactions on the primary.

Results (with dates removed to squeeze in a little better):

These won’t combine to give you some magical stopwatch or go/no go boolean but, if you monitor these leading up to the operation, they should give a good gauge about whether now is “the right time.” The query could possibly benefit from simplification and there are likely other columns you’ll want to look at: last_sent_time, last_hardened_time, last_redone_time, etc. The query above is what I used to determine the timing of my most recent planned failover.

Enjoying this article? Subscribe to the Simple Talk newsletter

Get selected articles, event information, podcasts and other industry content delivered straight to your inbox.
Subscribe now

What to do during a SQL Server failover

Once the failover begins, very little can influence the user experience. I always expect a brief connection interruption; anyone connected as the database changes state will need to re-establish their connection. With correct listener usage, retry logic, and connection strings that include MultiSubnetFailover=True, applications should recover quickly.

Accelerated Database Recovery makes undo negligible, so redo is really the only part of the process that will be volatile. This is why anything that reduces redo volume directly shortens the recovery portion of the failover. The only other thing I can do here is to wait it out.

What happens after a SQL Server failover?

A role change may finish quickly, but the new primary still starts cold. I described above some of the things I perform on the secondary, when possible, before it becomes primary (for example, warming the plan cache and buffer pool). When I can’t do these on the secondary before the failover, I perform them immediately after it has taken over.

If redo backlog is high at the time of failover, applications may see impact even after re-establishing connections – things like blocking, high I/O and latch contention, and WRITELOG waits. It may also experience slower queries due to “cold pages” – pages that are not yet in the buffer pool or are still waiting on redo to complete. If I discover that redo backlog is high or other signs that storage isn’t keeping up, now that we’re in the cloud I can very easily scale the storage up (for example, bumping up provisioned IOPs).

Aside from waiting, the only other things I can do to minimize impact to end users is to validate that everything is still configured the same (compare logins and jobs, re-enable any jobs I disabled before the failover, and re-test all linked servers). I also validate that monitoring and observability tools show green dashboards and recognize the new primary. While the databases might be happy, a failover that finishes fast will still feel like an outage if I leave anything else broken.

Failover improvements in SQL Server 2025

While these alone may not drive an upgrade decision, there are some built-in failover improvements to look forward to in SQL Server 2025 that do not require any configuration changes, settings, or trace flags, as documented here:

These enhancements are designed to reduce volatility by making buffer pool and plan cache warm-up more effective and providing more predictable catch-up behavior.  

That said, SQL Server 2025 is more aggressive with getting databases online, so I expect the resource and concurrency concerns to be higher in certain environments.

FAQs: How to optimize planned availability group failovers in SQL Server

1. What is a planned AG failover in SQL Server?

A planned AG failover in SQL Server switches the primary and secondary replicas in an Availability Group for maintenance, upgrades, or patching without data loss, minimizing disruption to users.

2. Why do SQL Server failovers sometimes take longer?

SQL Server failover duration depends on redo volume, checkpoint behavior, open transactions, and secondary readiness. High redo or cold caches can extend downtime even for small databases.

3. How can I speed up planned failovers?

Key techniques for speeding up planned failovers in SQL Server include:

  • Indirect checkpoints to reduce redo.

  • Manual checkpoints on busy databases.

  • Disabling log backups and long-running jobs temporarily.

  • Warming plan cache and buffer pool on the secondary.

  • Validating server configs and jobs before failover.

4. What are indirect checkpoints in SQL Server, and why are they important?

Indirect checkpoints in SQL Server provide smoother I/O and faster recovery by reducing dirty pages, which shortens redo time during failovers.

5. Should large SQL Server AGs be split into smaller groups?

Yes. Splitting large SQL Server Availability Groups reduces parallel recovery contention, making each failover faster and more predictable.

6. Does disabling read-only routing or log backups help?

Yes. Disabling read-only routing and heavy jobs or backups temporarily reduces redo workload, allowing the SQL Server failover to complete more quickly.

7. How does warming the secondary help failovers in SQL Server?

Pre-loading plan cache and buffer pool prevents cold cache slowdowns and query compile storms, improving performance immediately after failover in SQL Server.

8. What SQL Server 2025 features improve failovers?

SQL Server 2025 adds:

  • Faster redo and worker scheduling

  • Quicker Query Store load

  • Lower ADR overhead


    These make failovers more predictable and less disruptive.

9. How can I estimate the duration of a SQL Server failover?

Check redo/send queue sizes, redo rate, and active transactions on the primary to predict SQL Server failover timing and pick the optimal window.

10. What should I do after a SQL Server failover?

  • Validate logins, linked servers, and jobs

  • Re-enable backups and jobs

  • Warm plan cache and buffer pool if needed

  • Check monitoring dashboards for new primary status

Article tags

Load comments

About the author

Aaron Bertrand

See Profile

Aaron Bertrand (he/him), a cheery Canadian with industry experience dating back to SQL Server 6.5, is a Staff Database Reliability Engineer for Stack Overflow. He also blogs at sqlblog.org, mssqltips.com, and thebertrandfamily.com.