{"id":108963,"date":"2026-03-13T14:03:00","date_gmt":"2026-03-13T14:03:00","guid":{"rendered":"https:\/\/www.red-gate.com\/simple-talk\/?p=108963"},"modified":"2026-03-05T12:38:54","modified_gmt":"2026-03-05T12:38:54","slug":"why-the-cloud-is-not-a-disaster-recovery-strategy-for-your-critical-databases","status":"publish","type":"post","link":"https:\/\/www.red-gate.com\/simple-talk\/cloud\/why-the-cloud-is-not-a-disaster-recovery-strategy-for-your-critical-databases\/","title":{"rendered":"Why the cloud is not a disaster recovery strategy for your critical databases\u00a0"},"content":{"rendered":"\n<p>When <a href=\"https:\/\/www.bbc.co.uk\/news\/articles\/cev1en9077ro\" target=\"_blank\" rel=\"noreferrer noopener\">AWS stumbled &#8211; twice &#8211; in October 2025<\/a>, many teams discovered that \u201cwe are in the cloud\u201d is not the same as \u201cwe have <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/disaster-recovery-planning-for-data-the-cribsheet\/\" target=\"_blank\" rel=\"noreferrer noopener\">disaster recovery<\/a>\u201d.<\/p>\n\n\n\n<p>Applications went offline, customer-facing portals returned errors, and internal dashboards that teams rely on every morning failed to load.<\/p>\n\n\n\n<p>Most of those systems were already running on managed cloud services. They had multi-AZ databases, auto scaling groups, and health checks. What they did <em>not<\/em> have was a clear answer to three simple questions:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>How much data can we afford to lose?<br><br><\/li>\n\n\n\n<li>How long can we be down?<br><br><\/li>\n\n\n\n<li>Where do we run if this region doesn&#8217;t come back soon?<\/li>\n<\/ul>\n<\/div>\n\n\n<p>That gap between <em>infrastructure<\/em> and <em>intent<\/em> is where outages turn into business incidents. I see this pattern often when I talk to engineering and operations teams. The conversation usually goes like this:<\/p>\n\n\n\n<p>\u201cWe are on &lt;insert your favorite cloud provider&gt;. Everything is on managed services, so we are covered for DR.\u201d<\/p>\n\n\n\n<p>Cloud is a platform and disaster recovery is a responsibility. Managed services help, but they do not own your RTO (recovery time objective) and RPO (recovery point objective) &#8211; you do.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-why-the-cloud-is-an-environment-not-a-plan\">Why the cloud is an environment &#8211; not a plan<\/h2>\n\n\n\n<p>It helps to separate two ideas that often get blurred:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>The cloud<\/strong> is a set of capabilities: regions, availability zones, ease of deployment, snapshots, object storage, APIs, and automation.<br><br><\/li>\n\n\n\n<li><strong>Disaster recovery<\/strong> is a set of decisions: objectives (RTO\/RPO), topologies, runbooks, owners, and regular drills.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>You can be \u201c100% in the cloud\u201d and still have:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>A single-region database with no tested cross-region copy.<br><br><\/li>\n\n\n\n<li>Backups that no one has tried to restore in the last year.<br><br><\/li>\n\n\n\n<li>All critical services, such as identity, DNS, messaging, and even your <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/developing-a-backup-plan\/\" target=\"_blank\" rel=\"noreferrer noopener\">backup catalog or backup server<\/a>, are tied to that same region.<br><br><\/li>\n\n\n\n<li>No shared understanding of what \u201cacceptable downtime\u201d or \u201cacceptable loss of data\u201d actually means for the business.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>All of these are examples of treating the platform as if it were the strategy. From the outside, it looks modern and robust. Under stress, however, it behaves like a traditional single-data-center setup &#8211; just with different logos on the status page.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-why-high-availability-is-not-the-same-as-disaster-recovery\">Why high availability is not the same as disaster recovery<\/h2>\n\n\n\n<p>Managed databases and orchestrated clusters make it much easier to keep instances predictable. <a href=\"https:\/\/aws.amazon.com\/rds\/features\/multi-az\/\" target=\"_blank\" rel=\"noreferrer noopener\">Multi-AZ deployments<\/a>, auto-healing, and <a href=\"https:\/\/www.techtarget.com\/searchdisasterrecovery\/definition\/synchronous-replication\" target=\"_blank\" rel=\"noreferrer noopener\">synchronous replication<\/a> are valuable. They reduce the impact of hardware failures and local issues.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-managed-services-improve-availability-but-do-not-design-your-recovery\">Managed services improve availability, but do <em>not<\/em> design your recovery<\/h3>\n\n\n\n<p id=\"h-managed-services-improve-availability-however-they-do-not-design-your-recovery-they-handle-patching-failover-within-a-region-and-some-backups-they-do-not\">They handle patching, failover within a region, and some backups, but do <em>not<\/em>:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Define your RTO\/RPO<br><br><\/li>\n\n\n\n<li>Decide cross-region or cross-account replicas<br><br><\/li>\n\n\n\n<li>Test restores or run DR drills<br><br><\/li>\n\n\n\n<li>Coordinate application failover and dependencies<br><\/li>\n<\/ul>\n<\/div>\n\n\n<p>They do <em>not<\/em> solve for:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Control-plane or networking problems that affects the whole region.<br><br><\/li>\n\n\n\n<li>A bad deployment that corrupts data and replicates that corruption instantly.<br><br><\/li>\n\n\n\n<li>A compromised account where an attacker drops tables or changes configuration.<br><br><\/li>\n\n\n\n<li>Human errors that runs a destructive command on the primary.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>In all of these situations, \u201cmy managed database is multi-AZ\u201d gives you very little comfort. You still need a known-good copy in another fault domain, a way to promote it, and a set of steps that people can execute under pressure.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-why-replication-is-not-the-answer\">Why replication is not the answer<\/h3>\n\n\n\n<p><a href=\"https:\/\/docs.aws.amazon.com\/AmazonS3\/latest\/userguide\/replication.html\" target=\"_blank\" rel=\"noreferrer noopener\">Cross-region replication<\/a> doesn\u2019t even solve this. Replication replays every change, including the bad ones: a wrong <code>DELETE<\/code>, a buggy migration, or corrupted data from an application bug will be copied to every replica as fast and reliably as good data. That\u2019s why your real last line of defense is not \u201cmore replicas\u201d &#8211; it is <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/sql-server\/database-administration-sql-server\/dba-in-training-backups-sla-and-restore-strategies\/\" target=\"_blank\" rel=\"noreferrer noopener\">backups and tested restore procedures<\/a>. Only a backup taken before the damage, and a rehearsed way to bring it back online, can protect you from this class of failure.<\/p>\n\n\n\n<p>Availability keeps the lights on when small things go wrong. Disaster recovery is how you handle the day when something big does. In other words: Availability is preemptive in nature, Disaster Recovery is reactive.&nbsp;<\/p>\n\n\n\n<p><strong>Important:<\/strong> replication protects you from <em>infrastructure<\/em> failures, and disaster recovery protects you from your own data mistakes. Logical corruption and bad writes are faithfully replicated across all nodes. Backups and restore drills are what protect you from those.<\/p>\n\n\n\n<section id=\"my-first-block-block_d9f688d31ba2ccf6a9badbf42e7143d3\" class=\"my-first-block alignwide\">\n    <div class=\"bg-brand-600 text-base-white py-5xl px-4xl rounded-sm bg-gradient-to-r from-brand-600 to-brand-500 red\">\n        <div class=\"gap-4xl items-start md:items-center flex flex-col md:flex-row justify-between\">\n            <div class=\"flex-1 col-span-10 lg:col-span-7\">\n                <h3 class=\"mt-0 font-display mb-2 text-display-sm\">Simple Talk is brought to you by Redgate Software<\/h3>\n                <div class=\"child:last-of-type:mb-0\">\n                                            Take control of your databases with the trusted Database DevOps solutions provider. Automate with confidence, scale securely, and unlock growth through AI.                                    <\/div>\n            <\/div>\n                                            <a href=\"https:\/\/www.red-gate.com\/solutions\/overview\/\" class=\"btn btn--secondary btn--lg\" aria-label=\"Discover how Redgate can help you: Simple Talk is brought to you by Redgate Software\">Discover how Redgate can help you<\/a>\n                    <\/div>\n    <\/div>\n<\/section>\n\n\n<h2 class=\"wp-block-heading\" id=\"h-what-does-a-real-disaster-recovery-strategy-look-like\">What does a real disaster recovery strategy look like?<\/h2>\n\n\n\n<p>A proper DR strategy is surprisingly straightforward on paper. The difficulty is implementing it in practice and running repeated drills to ensure it works.<\/p>\n\n\n\n<p>It starts with business objectives rather than tools. For each critical system, you sit down with the people who own the outcome and agree on two numbers:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li><strong>RTO (Recovery Time Objective)<\/strong>: How long can this system be down?<br><br><\/li>\n\n\n\n<li><strong>RPO (Recovery Point Objective)<\/strong>: How much data, in time, can we afford to lose?<\/li>\n<\/ul>\n<\/div>\n\n\n<p>In practice, that RPO number is enforced by how you handle backups and <a href=\"https:\/\/www.dremio.com\/wiki\/transaction-log\/\" target=\"_blank\" rel=\"noreferrer noopener\">transaction logs<\/a>. For <a href=\"https:\/\/www.red-gate.com\/simple-talk\/databases\/postgresql\/\" target=\"_blank\" rel=\"noreferrer noopener\">PostgreSQL<\/a>, it comes down to how often you take base backups, how frequently you archive WAL (write-ahead logging), and how reliably you can restore to a specific point in time. If you claim a 15-minute RPO but your backups and WAL archiving only support restoring to within an hour, your real RPO is an hour; no matter what the slide deck says.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\" id=\"h-why-these-numbers-rarely-match-across-workloads\">Why these numbers rarely match across workloads<\/h3>\n\n\n\n<p>A reporting database might tolerate a few hours of downtime and some data loss. However, a payments ledger probably cannot.<\/p>\n\n\n\n<p>With those numbers in hand, you can design:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Whether you need a warm standby in another region or another provider.<br><br><\/li>\n\n\n\n<li>How you will move traffic there (DNS, load balancers, application configuration).<br><br><\/li>\n\n\n\n<li>How backups flow: which region they land in, which account owns them, and how long they are retained.<br><br><\/li>\n\n\n\n<li>What the runbook looks like when someone says, \u201cWe are invoking DR now\u201d.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>The tools you pick, whether <a href=\"https:\/\/docs.aws.amazon.com\/AmazonRDS\/latest\/UserGuide\/Welcome.html\" target=\"_blank\" rel=\"noreferrer noopener\">RDS<\/a> or FlexiServer, self-managed PostgreSQL with <a href=\"https:\/\/patroni.readthedocs.io\/en\/latest\/#\" target=\"_blank\" rel=\"noreferrer noopener\">Patroni<\/a>, <a href=\"https:\/\/www.red-gate.com\/simple-talk\/devops\/containers-and-virtualization\/kubernetes-for-complete-beginners\/\" target=\"_blank\" rel=\"noreferrer noopener\">Kubernetes<\/a> operators, or third-party backup software, are an implementation detail. Strategy is independent of brand names.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-the-uncomfortable-question-when-did-you-last-restore\">The uncomfortable question: when did you last restore?<\/h2>\n\n\n\n<p>The surest way to expose the gap between \u201cwe have backups\u201d and \u201cwe have disaster recovery\u201d is to ask one question: \u201cwhen did we last perform a full restore and cut a real application over to it?\u201d<\/p>\n\n\n\n<p>Not a theoretical walkthrough. Not a developer restoring a subset of data on their laptop. A timed, documented exercise that goes from \u201cassume region A has failed\u201d to \u201cusers are now served from region B\u201d.<\/p>\n\n\n\n<p>When teams run this exercise for the first time, a few things often appear:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Restores take longer than expected, pushing the real RTO far beyond the number written in slides.<br><br><\/li>\n\n\n\n<li>Application configuration is hard-coded to a single region or endpoint.<br><br><\/li>\n\n\n\n<li>Some dependencies (identity provider, message broker, and payment gateway integration, etc.) were never included in the DR thinking.<br><br><\/li>\n\n\n\n<li>Ownership is fuzzy: it is not clear who can make the call to fail over, or who coordinates the transition.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>None of this is a criticism of the teams. This is a common pattern: it simply happens when we assume the cloud will take care of everything and never rehearse the opposite.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-how-to-use-the-cloud-properly-for-disaster-recovery\">How to use the cloud properly for disaster recovery<\/h2>\n\n\n\n<p>The irony is that cloud platforms are excellent foundations for disaster recovery when used intentionally. You can:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Spin up parallel environments in another region using infrastructure-as-code.<br><br><\/li>\n\n\n\n<li>Create cross-region replicas for databases and storage with a few configuration changes.<br><br><\/li>\n\n\n\n<li>Store backups in a separate account and region, reducing the blast radius of a compromise.<br><br><\/li>\n\n\n\n<li>Use central logging and observability to monitor both primary and DR sites with the same tooling.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>The important shift is mental, where instead of saying \u201cwe are on &lt;insert your favorite cloud provider&gt;. Everything is on managed services. So we are covered for DR.\u201d, you say \u201cwe use &lt;insert your favorite cloud provider&gt; to implement our DR strategy, which looks like this\u201d.<\/p>\n\n\n\n<p>That strategy has names, diagrams, and runbooks. It is reviewed when systems change and, a few times a year, someone actively tests it by pushing the buttons and measuring what happens.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-bringing-it-back-to-your-own-systems\">Bringing it back to your own systems<\/h2>\n\n\n\n<p>If you want a quick sense check of where you stand today, you can do a simple exercise with your team:<\/p>\n\n\n<div class=\"block-core-list\">\n<ul class=\"wp-block-list\">\n<li>Pick one system that really matters to the business.<br><br><\/li>\n\n\n\n<li>Write down its RTO and RPO in plain language.<br><br><\/li>\n\n\n\n<li>Draw the current architecture on a single page, including regions, accounts, databases, storage, and key dependencies.<br><br><\/li>\n\n\n\n<li>On that diagram, mark where replicas live and where backups and WAL actually land (list out region, account, and service).<br><br><\/li>\n\n\n\n<li>Next to your RPO, write down which backups and WAL streams you would use to meet it, and how you would restore them.<br><br><\/li>\n\n\n\n<li>Describe, step by step, what you would do if the primary region were unavailable for 12 hours or if you discovered that the data in that system was corrupted.<\/li>\n<\/ul>\n<\/div>\n\n\n<p>If any of those steps are vague or rely on \u201cthe managed service will sort it out\u201d or \u201cthe replica will save us\u201d instead of a clear, tested restore path, you\u2019ve just found the places where cloud and disaster recovery have been quietly conflated. Should you discover gaps while mapping your DR, write them down, as they are the starting point of a real strategy.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\" id=\"h-summary-and-next-steps\">Summary and next steps<\/h2>\n\n\n\n<p>Cloud is a powerful platform. Disaster Recovery is a promise you make to the business about how much it will hurt when things go wrong: how long systems can be down and how much data can be lost. You keep that promise with architecture: replicas, backups and WAL archiving, cross-region copies, and rehearsed runbooks. Treat them as two separate things, and then deliberately combine them.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Cloud isn\u2019t a disaster recovery plan. Learn why multi-AZ and managed services aren\u2019t enough, and how RTO, RPO, backups, and tested restores define real DR.&hellip;<\/p>\n","protected":false},"author":346719,"featured_media":105342,"comment_status":"open","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[10,53,46],"tags":[5336,4168,158978,5765],"coauthors":[159376],"class_list":["post-108963","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-cloud","category-featured","category-data-security-privacy-compliance","tag-cloud","tag-database","tag-postgresql","tag-security-and-compliance"],"acf":[],"_links":{"self":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/108963","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/users\/346719"}],"replies":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/comments?post=108963"}],"version-history":[{"count":5,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/108963\/revisions"}],"predecessor-version":[{"id":108972,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/posts\/108963\/revisions\/108972"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media\/105342"}],"wp:attachment":[{"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/media?parent=108963"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/categories?post=108963"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/tags?post=108963"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/www.red-gate.com\/simple-talk\/wp-json\/wp\/v2\/coauthors?post=108963"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}