When AWS stumbled – twice – in October 2025, many teams discovered that “we are in the cloud” is not the same as “we have disaster recovery”.
Applications went offline, customer-facing portals returned errors, and internal dashboards that teams rely on every morning failed to load.
Most of those systems were already running on managed cloud services. They had multi-AZ databases, auto scaling groups, and health checks. What they did not have was a clear answer to three simple questions:
- How much data can we afford to lose?
- How long can we be down?
- Where do we run if this region doesn’t come back soon?
That gap between infrastructure and intent is where outages turn into business incidents. I see this pattern often when I talk to engineering and operations teams. The conversation usually goes like this:
“We are on <insert your favorite cloud provider>. Everything is on managed services, so we are covered for DR.”
Cloud is a platform and disaster recovery is a responsibility. Managed services help, but they do not own your RTO (recovery time objective) and RPO (recovery point objective) – you do.
Why the cloud is an environment – not a plan
It helps to separate two ideas that often get blurred:
- The cloud is a set of capabilities: regions, availability zones, ease of deployment, snapshots, object storage, APIs, and automation.
- Disaster recovery is a set of decisions: objectives (RTO/RPO), topologies, runbooks, owners, and regular drills.
You can be “100% in the cloud” and still have:
- A single-region database with no tested cross-region copy.
- Backups that no one has tried to restore in the last year.
- All critical services, such as identity, DNS, messaging, and even your backup catalog or backup server, are tied to that same region.
- No shared understanding of what “acceptable downtime” or “acceptable loss of data” actually means for the business.
All of these are examples of treating the platform as if it were the strategy. From the outside, it looks modern and robust. Under stress, however, it behaves like a traditional single-data-center setup – just with different logos on the status page.
Why high availability is not the same as disaster recovery
Managed databases and orchestrated clusters make it much easier to keep instances predictable. Multi-AZ deployments, auto-healing, and synchronous replication are valuable. They reduce the impact of hardware failures and local issues.
Managed services improve availability, but do not design your recovery
They handle patching, failover within a region, and some backups, but do not:
- Define your RTO/RPO
- Decide cross-region or cross-account replicas
- Test restores or run DR drills
- Coordinate application failover and dependencies
They do not solve for:
- Control-plane or networking problems that affects the whole region.
- A bad deployment that corrupts data and replicates that corruption instantly.
- A compromised account where an attacker drops tables or changes configuration.
- Human errors that runs a destructive command on the primary.
In all of these situations, “my managed database is multi-AZ” gives you very little comfort. You still need a known-good copy in another fault domain, a way to promote it, and a set of steps that people can execute under pressure.
Why replication is not the answer
Cross-region replication doesn’t even solve this. Replication replays every change, including the bad ones: a wrong DELETE, a buggy migration, or corrupted data from an application bug will be copied to every replica as fast and reliably as good data. That’s why your real last line of defense is not “more replicas” – it is backups and tested restore procedures. Only a backup taken before the damage, and a rehearsed way to bring it back online, can protect you from this class of failure.
Availability keeps the lights on when small things go wrong. Disaster recovery is how you handle the day when something big does. In other words: Availability is preemptive in nature, Disaster Recovery is reactive.
Important: replication protects you from infrastructure failures, and disaster recovery protects you from your own data mistakes. Logical corruption and bad writes are faithfully replicated across all nodes. Backups and restore drills are what protect you from those.
Simple Talk is brought to you by Redgate Software
What does a real disaster recovery strategy look like?
A proper DR strategy is surprisingly straightforward on paper. The difficulty is implementing it in practice and running repeated drills to ensure it works.
It starts with business objectives rather than tools. For each critical system, you sit down with the people who own the outcome and agree on two numbers:
- RTO (Recovery Time Objective): How long can this system be down?
- RPO (Recovery Point Objective): How much data, in time, can we afford to lose?
In practice, that RPO number is enforced by how you handle backups and transaction logs. For PostgreSQL, it comes down to how often you take base backups, how frequently you archive WAL (write-ahead logging), and how reliably you can restore to a specific point in time. If you claim a 15-minute RPO but your backups and WAL archiving only support restoring to within an hour, your real RPO is an hour; no matter what the slide deck says.
Why these numbers rarely match across workloads
A reporting database might tolerate a few hours of downtime and some data loss. However, a payments ledger probably cannot.
With those numbers in hand, you can design:
- Whether you need a warm standby in another region or another provider.
- How you will move traffic there (DNS, load balancers, application configuration).
- How backups flow: which region they land in, which account owns them, and how long they are retained.
- What the runbook looks like when someone says, “We are invoking DR now”.
The tools you pick, whether RDS or FlexiServer, self-managed PostgreSQL with Patroni, Kubernetes operators, or third-party backup software, are an implementation detail. Strategy is independent of brand names.
The uncomfortable question: when did you last restore?
The surest way to expose the gap between “we have backups” and “we have disaster recovery” is to ask one question: “when did we last perform a full restore and cut a real application over to it?”
Not a theoretical walkthrough. Not a developer restoring a subset of data on their laptop. A timed, documented exercise that goes from “assume region A has failed” to “users are now served from region B”.
When teams run this exercise for the first time, a few things often appear:
- Restores take longer than expected, pushing the real RTO far beyond the number written in slides.
- Application configuration is hard-coded to a single region or endpoint.
- Some dependencies (identity provider, message broker, and payment gateway integration, etc.) were never included in the DR thinking.
- Ownership is fuzzy: it is not clear who can make the call to fail over, or who coordinates the transition.
None of this is a criticism of the teams. This is a common pattern: it simply happens when we assume the cloud will take care of everything and never rehearse the opposite.
How to use the cloud properly for disaster recovery
The irony is that cloud platforms are excellent foundations for disaster recovery when used intentionally. You can:
- Spin up parallel environments in another region using infrastructure-as-code.
- Create cross-region replicas for databases and storage with a few configuration changes.
- Store backups in a separate account and region, reducing the blast radius of a compromise.
- Use central logging and observability to monitor both primary and DR sites with the same tooling.
The important shift is mental, where instead of saying “we are on <insert your favorite cloud provider>. Everything is on managed services. So we are covered for DR.”, you say “we use <insert your favorite cloud provider> to implement our DR strategy, which looks like this”.
That strategy has names, diagrams, and runbooks. It is reviewed when systems change and, a few times a year, someone actively tests it by pushing the buttons and measuring what happens.
Bringing it back to your own systems
If you want a quick sense check of where you stand today, you can do a simple exercise with your team:
- Pick one system that really matters to the business.
- Write down its RTO and RPO in plain language.
- Draw the current architecture on a single page, including regions, accounts, databases, storage, and key dependencies.
- On that diagram, mark where replicas live and where backups and WAL actually land (list out region, account, and service).
- Next to your RPO, write down which backups and WAL streams you would use to meet it, and how you would restore them.
- Describe, step by step, what you would do if the primary region were unavailable for 12 hours or if you discovered that the data in that system was corrupted.
If any of those steps are vague or rely on “the managed service will sort it out” or “the replica will save us” instead of a clear, tested restore path, you’ve just found the places where cloud and disaster recovery have been quietly conflated. Should you discover gaps while mapping your DR, write them down, as they are the starting point of a real strategy.
Summary and next steps
Cloud is a powerful platform. Disaster Recovery is a promise you make to the business about how much it will hurt when things go wrong: how long systems can be down and how much data can be lost. You keep that promise with architecture: replicas, backups and WAL archiving, cross-region copies, and rehearsed runbooks. Treat them as two separate things, and then deliberately combine them.
Load comments