{"id":108370,"date":"2026-02-20T14:00:00","date_gmt":"2026-02-20T14:00:00","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=108370"},"modified":"2026-02-17T10:42:53","modified_gmt":"2026-02-17T10:42:53","slug":"how-to-optimize-planned-availability-group-failovers-in-sql-server","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/how-to-optimize-planned-availability-group-failovers-in-sql-server\/","title":{"rendered":"How to optimize planned availability group failovers in SQL Server"},"content":{"rendered":"\n<p>We often perform planned availability group failovers in <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQL Server<\/a> for <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/database-maintenance-plans-in-ssms-an-overview\/\" target=\"_blank\" rel=\"noreferrer noopener\">maintenance<\/a>, patching, upgrades, and even hardware rotation. Typically, our failovers are fast, but sometimes they take longer \u2013 and it\u2019s not always intuitive why, as there are no obvious ties to time of day, database size, or transaction volume.<\/p>\n\n\n\n<p>Shaving even a handful of seconds from the process can improve the application and end user experience; it can also drastically reduce alert noise or, at least, how long <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/sql-server-alerts-soup-to-nuts\/\" target=\"_blank\" rel=\"noreferrer noopener\">alerts<\/a> have to stay muted. There\u2019s a lot of material out there about <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/learn\/sql-server-high-availability-and-disaster-recovery-plan\/\" target=\"_blank\" rel=\"noreferrer noopener\">performing AG failovers correctly<\/a> (no data loss), but far less that focuses on shortening the disruption window. The difference is usually some combination of redo volume, checkpoint behavior, open transactions, and secondary readiness.<\/p>\n\n\n\n<p>I wanted to share some techniques I use to make planned failovers faster and more predictable. Some of these techniques are well documented, while others come from real-world patterns I\u2019ve observed across many SQL Server environments. I\u2019ll talk about what I do before, during, and after the failover to minimize disruption and increase the chance that end users are oblivious that anything happened.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-to-prepare-for-a-sql-server-failover-a-checklist\">How to prepare for a SQL Server failover: a checklist<\/h2>\n\n\n\n<p>There are several items on my pre-failover checklist that can help with operational times. Some are set-once configuration options, and some are run-once tasks:<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-break-apart-your-availability-groups\">Break apart your availability groups<\/h3>\n\n\n\n<p>Back when we had problems with failovers, one of the contributing factors was too many databases in an AG (nearly 400). When you fail over 400 databases at the same time, you\u2019re putting a lot of stress on the soon-to-be-primary, and it can\u2019t process all of them at once. Databases waiting to transition are stuck in limbo, and only increase the disruption window.<\/p>\n\n\n\n<p>The bottleneck is the set of parallel recovery threads allocated per database; more databases means more concurrent recovery workers competing for CPU and I\/O. You may see errors like this one while attempting to bring many databases online:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \">Msg 35217, Severity 16, State 1\nThe thread pool for Always On Availability Groups was unable to start a new worker thread because \nthere are not enough available worker threads.  This may degrade Always On Availability Groups\nperformance. Use the \"max worker threads\" configuration option to increase number of allowable \nthreads.\n<\/pre><\/div>\n\n\n\n<p>That doesn\u2019t mean just increase <a href=\"https:\/\/www.brentozar.com\/archive\/2017\/02\/max-worker-threads-dont-touch\/\" target=\"_blank\" rel=\"noreferrer noopener\">max worker threads<\/a>, which may be counter-productive unless you also increase the number of CPUs able to conquer the work and somehow address bursts of concurrency.<\/p>\n\n\n\n<p>Note that, unless you move AGs to other replicas, breaking your AGs into smaller sets won\u2019t reduce the <em>overall process time<\/em>, since you just have to fail over more AGs. But it sure will make each AG finish its own failover quicker.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Enable indirect checkpoints<\/h3>\n\n\n\n<p>Any <code>CHECKPOINT<\/code> on the primary can reduce redo time, because the secondary will have fewer dirty pages to recover. This does not mean I can eliminate redo altogether, since it depends on the redo backlog and how much log is being generated during the failover window.<\/p>\n\n\n\n<p>Depending on the version of SQL Server, and where the databases were originally created, they may still be set to use old-style checkpoints. The newer approach of using indirect checkpoints can have many performance benefits, most notably smoother I\/O patterns. I talk about some of the benefits in <a href=\"https:\/\/www.mssqltips.com\/sqlservertip\/7108\/sql-server-indirect-checkpoints-why-enable\/\" target=\"_blank\" rel=\"noreferrer noopener\">Why Enable SQL Server Indirect Checkpoints<\/a> and <a href=\"https:\/\/sqlperformance.com\/2020\/05\/system-configuration\/0-to-60-switching-to-indirect-checkpoints\" target=\"_blank\" rel=\"noreferrer noopener\">&#8220;0 to 60&#8221; : Switching to indirect checkpoints<\/a>, and you can see Microsoft\u2019s explanation of the changes in <a href=\"https:\/\/techcommunity.microsoft.com\/blog\/sqlserver\/changes-in-sql-server-2016-checkpoint-behavior\/305342\" target=\"_blank\" rel=\"noreferrer noopener\">Changes in SQL Server 2016 Checkpoint Behavior<\/a>.<\/p>\n\n\n\n<p>To find any databases using legacy checkpoints:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \">SELECT name \n  FROM sys.databases \n WHERE target_recovery_time_in_seconds = 0;\n<\/pre><\/div>\n\n\n\n<p>To fix:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \">ALTER DATABASE &lt;db name&gt; \n  SET TARGET_RECOVERY_TIME = 60 SECONDS;\n<\/pre><\/div>\n\n\n\n<p>On modern versions of SQL Server, I haven\u2019t found any compelling reason to keep the legacy checkpoint behavior.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Run manual checkpoints<\/h3>\n\n\n\n<p>Automatic checkpointing often lags in environments like ours, with more databases than CPU cores, because the engine cycles through databases cooperatively. Indirect checkpoints can help, but may struggle to keep up under extreme write pressure or when a lot of databases are involved.<\/p>\n\n\n\n<p>Before failovers, I\u2019ve been manually running checkpoints and, while this is anecdotal, have observed substantially reduced failover times as a result. Admittedly, I had to spend more time up front waiting for those checkpoints to run, but extra time <em>before<\/em> the failover is better than spending more time on redo <em>during<\/em> the failover.<\/p>\n\n\n\n<p>I can generate the commands to run checkpoints in each database for a given AG, with an overly cautious filter to triple-check that I\u2019m on primary:<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \">DECLARE @AGName sysname = N'&lt;ag name&gt;',\n        @sql    nvarchar(max);\nSELECT @sql = STRING_AGG\n              (\n                CONVERT(nvarchar(max), \n                CONCAT(N'EXEC ', QUOTENAME(name), \n                N'.sys.sp_executesql N''CHECKPOINT'';')), \n                char(13)+char(10)\n              )\n  FROM sys.databases AS d\n WHERE d.state = 0 \n   AND EXISTS\n   (\n     SELECT 1\n       FROM sys.dm_hadr_database_replica_states AS rs\n      INNER JOIN sys.availability_groups AS ags\n         ON rs.group_id = ags.group_id\n      WHERE rs.database_id        = d.database_id\n        AND ags.name              = @AGName\n        AND rs.is_local           = 1\n        AND rs.is_primary_replica = 1\n   );\nPRINT @sql;\n\/* EXEC sys.sp_executesql @sql; *\/<\/pre><\/div>\n\n\n\n<p>If this takes a while, I run it again to capture any new bursts of write activity that occurred after the first pass. And I fail over that AG before I start checkpoints on the next AG.<\/p>\n\n\n\n<p>Something I can add to the query to minimize any gaps is an <code>ORDER BY<\/code> to sort the databases so that the <em>busiest<\/em> databases run last (at least when I know or take the time to determine which databases are busiest, or have the highest page churn.) <\/p>\n\n\n\n<p>On our primary public system, there should be no surprise which database is our busiest; I run <code>CHECKPOINT<\/code> in a loop against only that database, every dozen seconds or so, right up to the point I initiate the failover. But this is in a very specific, very busy database; once dirty page volume is already low, repeated checkpoints may provide diminishing returns and are generally likely to just move work instead of reduce work.<\/p>\n\n\n\n<section id=\"my-first-block-block_65d42c418c3b78ea394f1f52dcea2906\" class=\"my-first-block alignwide\">\n    <div class=\"bg-brand-600 text-base-white py-5xl px-4xl rounded-sm bg-gradient-to-r from-brand-600 to-brand-500 red\">\n        <div class=\"gap-4xl items-start md:items-center flex flex-col md:flex-row justify-between\">\n            <div class=\"flex-1 col-span-10 lg:col-span-7\">\n                <h3 class=\"mt-0 font-display mb-2 text-display-sm\">Fast, reliable and consistent SQL Server development&#8230;<\/h3>\n                <div class=\"child:last-of-type:mb-0\">\n                                            &#8230;with SQL Toolbelt Essentials. 10 ingeniously simple tools for accelerating development, reducing risk, and standardizing workflows.                                    <\/div>\n            <\/div>\n                                            <a href=\"https:\/\/www.red-gate.com\/products\/sql-toolbelt-essentials\/\" class=\"btn btn--secondary btn--lg\" aria-label=\"Learn more &amp; try for free: Fast, reliable and consistent SQL Server development...\">Learn more &amp; try for free<\/a>\n                    <\/div>\n    <\/div>\n<\/section>\n\n\n<h3 class=\"wp-block-heading\">Disable long-running processes<\/h3>\n\n\n\n<p>Failover cannot complete until the secondary has applied all log records sent from the primary via the send queue. The send queue size directly influences failover time, as Microsoft documents in <a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/database-engine\/availability-groups\/windows\/monitor-performance-for-always-on-availability-groups\" target=\"_blank\" rel=\"noreferrer noopener\">Monitor AG performance<\/a> \u2013 along with a detailed list of operations in the data synchronization process. Even a modest backlog slows recovery; quieting heavy workloads before failover allows the queues to clear.<\/p>\n\n\n\n<p>We disable our <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/sql-server-backup-types\/\" target=\"_blank\" rel=\"noreferrer noopener\">log backup<\/a> jobs shortly before the failover to eliminate additional log generation and movement. We avoid performing failovers anywhere near full or differential backups (if these conflict, we adjust the start date of those jobs to well after the expected end of the maintenance window). We also disable any jobs known to perform heavy writes or long-running transactions \u2013 think ETL-style jobs and <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/do-not-forget-to-maintain-your-indexes\/\" target=\"_blank\" rel=\"noreferrer noopener\">index maintenance<\/a> \u2013 which can extend undo and redo time.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Consider enabling Accelerated Database Recovery (ADR)<\/h3>\n\n\n\n<p>If you\u2019re reading this today and your failover is planned for tonight, it\u2019s almost certainly too late for this, but Accelerated Database Recovery (ADR) is a setting we have enabled everywhere and has helped in many aspects \u2013 including detaching recovery time from transaction length (I talk about how this helps in <a href=\"https:\/\/www.mssqltips.com\/sqlservertip\/5971\/accelerated-database-recovery-in-sql-server-2019\/\" target=\"_blank\" rel=\"noreferrer noopener\">Accelerated Database Recovery in SQL Server 2019<\/a>). <\/p>\n\n\n\n<p>For failovers specifically, this makes undo asynchronous, though it can\u2019t change redo\u2026 so even with ADR, disabling long-running processes and anything else you can do to minimize redo time will still be a win.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Disable read-only routing<\/h3>\n\n\n\n<p>If the primary can handle it, I disable <a href=\"https:\/\/sqlatspeed.com\/2017\/12\/20\/create-a-read-only-routing-list-a-quick-how-to\/\" target=\"_blank\" rel=\"noreferrer noopener\">read-only routing<\/a> several hours before the failover, or before whatever action is leading up to the failover (like patching, upgrading, or rebooting the readable secondary first). This doesn\u2019t make the failover faster, but it makes the pre-failover and secondary maintenance phases safer. <\/p>\n\n\n\n<p>Depending on how much read-intent traffic the applications are sending to the secondary, it can take a while to drain. The application may fail if it is using connection pooling and some of those sticky sessions are still holding and using connections to the secondary while patching or rebooting. <\/p>\n\n\n\n<p>If I can roll through the application nodes and recycle them or otherwise flush their connection pools, any re-established sessions should only be connected to the primary. I monitor the active connections on the secondary and make sure they are winding down to almost nothing (ignoring usual suspects like <a href=\"https:\/\/www.red-gate.com\/products\/redgate-monitor\/\" target=\"_blank\" rel=\"noreferrer noopener\">monitoring tools<\/a> and system sessions).<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Warm the secondary<\/h3>\n\n\n\n<p>It can be very helpful to warm up the <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/performance-sql-server\/sql-server-plan-cache-mining-plan-attributes\/\" target=\"_blank\" rel=\"noreferrer noopener\">plan cache<\/a> and buffer pool on the current secondary <em>before<\/em> it becomes primary. This is especially true if I\u2019ve restarted the secondary before the failover, since its plan cache and buffer pool will both be empty (and because, hopefully, no read-only traffic is going there yet).<\/p>\n\n\n\n<p>I warm the plan cache by executing a battery of known, expensive <code>SELECT<\/code> queries against the secondary (after it is restarted, if it is being restarted), to avoid a compile storm when it becomes primary and all those queries suddenly hit it all at once.<\/p>\n\n\n\n<p>An even bigger benefit is filling the buffer pool so that users bypass I\/O-related waits while those pages get loaded into memory. To do this, I perform a series of queries that cause full scans of our largest tables and most frequently used indexes. I like the <code>COUNT_BIG(*)<\/code> approach, which forces a full index scan and minimizes per-row CPU overhead. I\u2019ve seen techniques out there like <code>CHECKSUM_AGG(CHECKSUM())<\/code> which is likely to punish my CPUs a lot more.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \">SELECT COUNT_BIG(*) \n  FROM dbo.&lt;huge table&gt; WITH (INDEX(1));\nSELECT COUNT_BIG(*) \n  FROM &lt;huge table&gt; WITH (INDEX(&lt;heavily used index&gt;));\n<\/pre><\/div>\n\n\n\n<p>This does not guarantee identical performance immediately after failover, but significantly reduces the initial shock of cold cache and compile storms.<\/p>\n\n\n\n<p>Note that these options aren\u2019t possible in an upgrade scenario; the secondary won\u2019t be readable after it has been upgraded but before it has become primary. Microsoft says, \u201c<a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/database-engine\/availability-groups\/windows\/active-secondaries-readable-secondary-replicas-always-on-availability-groups?view=sql-server-ver17#bkmk_Prerequisites\" target=\"_blank\" rel=\"noreferrer noopener\">Only replicas that are on the same major build of SQL Server will be readable.<\/a>\u201d In that scenario, it\u2019s still something I can do immediately <em>after<\/em> it has become primary; it\u2019s certainly a better end user experience if <em>I<\/em> suffer some of those resource pains instead of them.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-utilize-query-store-trace-flags\">Utilize Query Store trace flags<\/h3>\n\n\n\n<p>In environments where I use <a href=\"https:\/\/www.red-gate.com\/simple-talk\/blogs\/t-sql-tuesday-181-query-store-and-its-evolution\/\" target=\"_blank\" rel=\"noreferrer noopener\">Query Store<\/a>, there are two trace flags that can help reduce the time spent during failovers, service restarts, and reboots, especially if Query Store is large.<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>TF 7745<\/strong><br>This prevents Query Store from flushing its data to disk during database shutdown (at the cost of losing any data still only in memory, which is acceptable to me). Erin Stellato talks about this trace flag in <a href=\"https:\/\/www.sqlskills.com\/blogs\/erin\/query-store-best-practices\/\" target=\"_blank\" rel=\"noreferrer noopener\">Query Store Best Practices<\/a>.<br><br><\/li>\n\n\n\n<li><strong>TF 7752<\/strong><br>Prior to SQL Server 2019, default Query Store behavior was to load synchronously, blocking queries until it was fully loaded (you may experience <code>QDS_LOADDB<\/code> waits on startup). Under this trace flag (and by default since SQL Server 2019, where the trace flag becomes no longer necessary), Query Store loads asynchronously and is read-only until it has loaded. Erin also talks about this behavior in both the aforementioned article and in <a href=\"https:\/\/www.sqlskills.com\/blogs\/erin\/query-store-trace-flags\/\" target=\"_blank\" rel=\"noreferrer noopener\">Query Store Trace Flags<\/a>.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>Neither trace flag affects redo, checkpoint behavior, or failover logic\u2026 but they can remove delays that contribute to longer <em>perceived<\/em> failover times.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Validate configurations<\/h3>\n\n\n\n<p>Before the failover, I make sure that server-level objects like logins and linked servers are consistent and functioning correctly. I validate that <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/setting-up-your-sql-server-agent-correctly\/\" target=\"_blank\" rel=\"noreferrer noopener\">SQL Server Agent jobs<\/a> are identical and, when necessary, are primary\/secondary aware. If any of these are misconfigured after the failover, they will slow down my ability to get back to 100% functionality, and users may notice.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Check potential redo time<\/h3>\n\n\n\n<p>You can calculate a rough, rough, <em>really rough<\/em> estimate of how long a failover might take if initiated right now, based on send and redo queue size, redo rate, and the age of active transactions on the primary.<\/p>\n\n\n\n<div class=\"wp-block-urvanov-syntax-highlighter-code-block\"><pre class=\"lang:tsql decode:true \">\/* run on primary *\/\n DECLARE @AGName nvarchar(128) = N'ag name',\n         @DBName nvarchar(128) = NULL;\n;WITH drs AS\n(\n     SELECT ag = ag.name, \n            [replica] = ar.replica_server_name, \n            drs.*\n       FROM sys.dm_hadr_database_replica_states AS drs\n      INNER JOIN sys.availability_replicas AS ar\n        ON drs.replica_id = ar.replica_id\n      INNER JOIN sys.availability_groups AS ag\n         ON ar.group_id = ag.group_id\n      WHERE ag.name = COALESCE(@AGName, ag.name) \n        AND drs.database_id = COALESCE(DB_ID(@DBName), drs.database_id)\n),\npri AS \n(\n  SELECT drs.database_id, \n         drs.group_id, \n         oldest_pri_tran = MIN(t.transaction_begin_time), \n         last_pri_commit = MIN(drs.last_commit_time) \n    FROM drs\n   OUTER APPLY \n   (\n     SELECT db = dt.database_id,\n            t.transaction_begin_time\n       FROM sys.dm_tran_database_transactions AS dt\n      INNER JOIN sys.dm_tran_active_transactions AS t\n         ON dt.transaction_id = t.transaction_id\n      WHERE dt.database_id = drs.database_id\n   ) AS t\n   WHERE drs.is_local = 1\n     AND drs.is_primary_replica = 1\n   GROUP BY drs.database_id, drs.group_id\n)\nSELECT drs.ag, \n       [replica]  = drs.[replica],\n       [database] = DB_NAME(pri.database_id),\n       sync_state = drs.synchronization_state_desc,\n       health     = drs.synchronization_health_desc, \n       [susp?]    = drs.is_suspended,\n       pri.oldest_pri_tran, \n       pri.last_pri_commit, \n       last_sec_commit = drs.last_commit_time,\n       delta = DATEDIFF(MILLISECOND, drs.last_commit_time, pri.last_pri_commit),\n       [lag]      = drs.secondary_lag_seconds,\n       send_kb    = drs.log_send_queue_size,\n       redo_kb    = drs.redo_queue_size,\n       drs.redo_rate,\n       approx_redo_sec = CONVERT(decimal(19,2), \n                         drs.redo_queue_size*1.0 \/ NULLIF(drs.redo_rate,0))\n  FROM pri\n CROSS JOIN drs\n WHERE pri.group_id = drs.group_id\n   AND drs.is_local = 0 \n   AND drs.is_primary_replica = 0\n   AND pri.database_id = drs.database_id\n ORDER BY drs.ag,\n          [replica],\n          [database];<\/pre><\/div>\n\n\n\n<p>Results (with dates removed to squeeze in a little better):<\/p>\n\n\n\n<figure class=\"wp-block-image size-large\"><img loading=\"lazy\" decoding=\"async\" width=\"1024\" height=\"186\" src=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/02\/image-36-1024x186.png\" alt=\"\" class=\"wp-image-108371\" srcset=\"https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/02\/image-36-1024x186.png 1024w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/02\/image-36-300x54.png 300w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/02\/image-36-768x139.png 768w, https:\/\/www.red-gate.com\/simple-talk\/wp-content\/uploads\/2026\/02\/image-36.png 1125w\" sizes=\"auto, (max-width: 1024px) 100vw, 1024px\" \/><\/figure>\n\n\n\n<p>These won\u2019t combine to give you some magical stopwatch or go\/no go boolean but, if you monitor these leading up to the operation, they should give a good gauge about whether now is \u201cthe right time.\u201d The query could possibly benefit from simplification and there are likely other columns you\u2019ll want to look at:<code> last_sent_time<\/code>, <code>last_hardened_time<\/code>, <code>last_redone_time<\/code>, etc. The query above is what I used to determine the timing of my most recent planned failover.<\/p>\n\n\n\n<section id=\"my-first-block-block_3cc8874bbc6e968bdaf674c15ec7c180\" class=\"my-first-block alignwide\">\n    <div class=\"bg-brand-600 text-base-white py-5xl px-4xl rounded-sm bg-gradient-to-r from-brand-600 to-brand-500 red\">\n        <div class=\"gap-4xl items-start md:items-center flex flex-col md:flex-row justify-between\">\n            <div class=\"flex-1 col-span-10 lg:col-span-7\">\n                <h3 class=\"mt-0 font-display mb-2 text-display-sm\">Enjoying this article? Subscribe to the Simple Talk newsletter<\/h3>\n                <div class=\"child:last-of-type:mb-0\">\n                                            Get selected articles, event information, podcasts and other industry content delivered straight to your inbox.                                    <\/div>\n            <\/div>\n                                            <a href=\"https:\/\/www.red-gate.com\/simple-talk\/subscribe\/\" class=\"btn btn--secondary btn--lg\" aria-label=\"Subscribe now: Enjoying this article? Subscribe to the Simple Talk newsletter\">Subscribe now<\/a>\n                    <\/div>\n    <\/div>\n<\/section>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-to-do-during-a-sql-server-failover\">What to do during a SQL Server failover<\/h2>\n\n\n\n<p>Once the failover begins, very little can influence the user experience. I always expect a brief connection interruption; anyone connected as the database changes state will need to re-establish their connection. With correct listener usage, retry logic, and connection strings that include <code>MultiSubnetFailover=True<\/code>, applications should recover quickly. <\/p>\n\n\n\n<p>Accelerated Database Recovery makes undo negligible, so redo is really the only part of the process that will be volatile. This is why anything that reduces redo volume directly shortens the recovery portion of the failover. The only other thing I can do here is to wait it out.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-happens-after-a-sql-server-failover\">What happens after a SQL Server failover?<\/h2>\n\n\n\n<p>A role change may finish quickly, but the new primary still starts cold. I described above some of the things I perform on the secondary, when possible, before it becomes primary (for example, warming the plan cache and buffer pool). When I can\u2019t do these on the secondary before the failover, I perform them immediately after it has taken over.<\/p>\n\n\n\n<p>If redo backlog is high at the time of failover, applications may see impact even after re-establishing connections \u2013 things like blocking, high I\/O and latch contention, and <code>WRITELOG<\/code> waits. It may also experience slower queries due to \u201ccold pages\u201d &#8211; pages that are not yet in the buffer pool or are still waiting on redo to complete. If I discover that redo backlog is high or other signs that storage isn\u2019t keeping up, now that we\u2019re in the cloud I can very easily scale the storage up (for example, bumping up provisioned IOPs).<\/p>\n\n\n\n<p>Aside from waiting, the only other things I can do to minimize impact to end users is to validate that everything is still configured the same (compare logins and jobs, re-enable any jobs I disabled before the failover, and re-test all linked servers). I also validate that monitoring and observability tools show green dashboards and recognize the new primary. While the databases might be happy, a failover that finishes fast will still <em>feel like<\/em> an outage if I leave anything else broken.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-failover-improvements-in-sql-server-2025\">Failover improvements in SQL Server 2025<\/h2>\n\n\n\n<p>While these alone may not drive an upgrade decision, there are some built-in failover improvements to look forward to in <a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/sql-server\/what-s-new-in-sql-server-2025?view=sql-server-ver17\" target=\"_blank\" rel=\"noreferrer noopener\">SQL Server 2025<\/a> that do not require any configuration changes, settings, or trace flags, <a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/sql-server\/what-s-new-in-sql-server-2025?view=sql-server-ver17#availability-and-disaster-recovery\" target=\"_blank\" rel=\"noreferrer noopener\">as documented here<\/a>:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Improved redo performance and worker scheduling<br><\/li>\n\n\n\n<li>Faster Query Store initialization<br><\/li>\n\n\n\n<li>Lower ADR overhead<br><\/li>\n\n\n\n<li><a href=\"https:\/\/learn.microsoft.com\/en-us\/sql\/relational-databases\/performance\/query-store-for-secondary-replicas\" target=\"_blank\" rel=\"noreferrer noopener\">Query Store improvements on readable secondaries<\/a><\/li>\n<\/ul>\n<\/div>\n\n\n<p>These enhancements are designed to reduce volatility by making buffer pool and plan cache warm-up more effective and providing more predictable catch-up behavior. &nbsp;<\/p>\n\n\n\n<p>That said, SQL Server 2025 is more aggressive with getting databases online, so I expect the resource and concurrency concerns to be higher in certain environments.<\/p>\n\n\n\n<section id=\"faq\" class=\"faq-block my-5xl\">\n    <h2>FAQs: How to optimize planned availability group failovers in SQL Server<\/h2>\n\n                        <h3 class=\"mt-4xl\">1. What is a planned AG failover in SQL Server?<\/h3>\n            <div class=\"faq-answer\">\n                <p>A planned AG failover in SQL Server switches the primary and secondary replicas in an Availability Group for maintenance, upgrades, or patching without data loss, minimizing disruption to users.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">2. Why do SQL Server failovers sometimes take longer?<\/h3>\n            <div class=\"faq-answer\">\n                <p>SQL Server failover duration depends on redo volume, checkpoint behavior, open transactions, and secondary readiness. High redo or cold caches can extend downtime even for small databases.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">3. How can I speed up planned failovers?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"790\" data-end=\"815\">Key techniques for speeding up planned failovers in SQL Server include:<\/p>\n<ul data-start=\"816\" data-end=\"1092\">\n<li data-start=\"816\" data-end=\"860\">\n<p data-start=\"818\" data-end=\"860\">Indirect checkpoints to reduce redo.<\/p>\n<\/li>\n<li data-start=\"861\" data-end=\"906\">\n<p data-start=\"863\" data-end=\"906\">Manual checkpoints on busy databases.<\/p>\n<\/li>\n<li data-start=\"907\" data-end=\"971\">\n<p data-start=\"909\" data-end=\"971\">Disabling log backups and long-running jobs temporarily.<\/p>\n<\/li>\n<li data-start=\"972\" data-end=\"1032\">\n<p data-start=\"974\" data-end=\"1032\">Warming plan cache and buffer pool on the secondary.<\/p>\n<\/li>\n<li data-start=\"1033\" data-end=\"1092\">\n<p data-start=\"1035\" data-end=\"1092\">Validating server configs and jobs before failover.<\/p>\n<\/li>\n<\/ul>\n            <\/div>\n                    <h3 class=\"mt-4xl\">4. What are indirect checkpoints in SQL Server, and why are they important?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Indirect checkpoints in SQL Server provide smoother I\/O and faster recovery by reducing dirty pages, which shortens redo time during failovers.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">5. Should large SQL Server AGs be split into smaller groups?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Yes. Splitting large SQL Server Availability Groups reduces parallel recovery contention, making each failover faster and more predictable.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">6. Does disabling read-only routing or log backups help?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Yes. Disabling read-only routing and heavy jobs or backups temporarily reduces redo workload, allowing the SQL Server failover to complete more quickly.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">7. How does warming the secondary help failovers in SQL Server?<\/h3>\n            <div class=\"faq-answer\">\n                <p>Pre-loading plan cache and buffer pool prevents cold cache slowdowns and query compile storms, improving performance immediately after failover in SQL Server.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">8. What SQL Server 2025 features improve failovers?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"1997\" data-end=\"2020\">SQL Server 2025 adds:<\/p>\n<ul data-start=\"2021\" data-end=\"2187\">\n<li data-start=\"2021\" data-end=\"2062\">\n<p data-start=\"2023\" data-end=\"2062\">Faster redo and worker scheduling<\/p>\n<\/li>\n<li data-start=\"2063\" data-end=\"2095\">\n<p data-start=\"2065\" data-end=\"2095\">Quicker Query Store load<\/p>\n<\/li>\n<li data-start=\"2096\" data-end=\"2187\">\n<p data-start=\"2098\" data-end=\"2187\">Lower ADR overhead<br data-start=\"2120\" data-end=\"2123\" \/><br \/>\n<br \/>\nThese make failovers more predictable and less disruptive.<\/p>\n<\/li>\n<\/ul>\n            <\/div>\n                    <h3 class=\"mt-4xl\">9. How can I estimate the duration of a SQL Server failover?<\/h3>\n            <div class=\"faq-answer\">\n                <p data-start=\"2241\" data-end=\"2381\">Check redo\/send queue sizes, redo rate, and active transactions on the primary to predict SQL Server failover timing and pick the optimal window.<\/p>\n            <\/div>\n                    <h3 class=\"mt-4xl\">10. What should I do after a SQL Server failover?<\/h3>\n            <div class=\"faq-answer\">\n                <ul>\n<li data-start=\"2433\" data-end=\"2482\">\n<p data-start=\"2435\" data-end=\"2482\">Validate logins, linked servers, and jobs<\/p>\n<\/li>\n<li data-start=\"2483\" data-end=\"2517\">\n<p data-start=\"2485\" data-end=\"2517\">Re-enable backups and jobs<\/p>\n<\/li>\n<li data-start=\"2518\" data-end=\"2567\">\n<p data-start=\"2520\" data-end=\"2567\">Warm plan cache and buffer pool if needed<\/p>\n<\/li>\n<li data-start=\"2568\" data-end=\"2626\">\n<p data-start=\"2570\" data-end=\"2626\">Check monitoring dashboards for new primary status<\/p>\n<\/li>\n<\/ul>\n            <\/div>\n            <\/section>\n","protected":false},"excerpt":{"rendered":"<p>Learn how to perform fast and predictable SQL Server planned Availability Group (AG) failovers. Discover tips to reduce downtime, optimize redo, warm secondary caches, and improve end-user experience during maintenance and upgrades.&hellip;<\/p>\n","protected":false},"author":341115,"featured_media":108376,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[143523,53,143524],"tags":[4168,4170,4150,4151],"coauthors":[158980],"class_list":["post-108370","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-databases","category-featured","category-sql-server","tag-database","tag-database-administration","tag-sql","tag-sql-server"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/108370","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/341115"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=108370"}],"version-history":[{"count":7,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/108370\/revisions"}],"predecessor-version":[{"id":108441,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/108370\/revisions\/108441"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/108376"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=108370"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=108370"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=108370"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=108370"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}