Zero Downtime Database Deployments are a Lie
Many people assume that building proficiency at database development and operations leads to the ability to attain "zero downtime deployments." In this post I share why "zero downtime" is a problematic goal, and what a better approach looks like.
Over the course of my career, I’ve often been asked to help people attain “zero downtime” for relational database deployments.
However, the more I work with databases and with people, the more I am convinced that not only are zero downtime deployments not attainable without some serious skewing of the meaning of “zero,” but that “zero downtime” is not at all a helpful goal to chase.
Yes, you should deploy while the system is live
I want to start out by making it clear that I’m NOT suggesting you only do database deployments during planned or scheduled downtimes.
To the contrary, I am a fan of frequently deploying changes to databases while the system is live for users. With modern applications that serve users globally around the clock, every day of the week, this is simply a requirement to deliver value to customers at a good tempo.
I also believe that it’s very desirable to make deployments as boring as possible, both for your team and for your users. Ideally, nobody notices a deployment.
The goal of frequent, boring deployments is not the same as having a goal of “zero downtime” deployments for some very human reasons:
1. Failure is inevitable
Failures will occur. Your goal is to identify failures as early as possible, and to reduce the stress of failures as much as possible for your team and your customers. You may still want to count those incidents as contributing “downtime,” however.
2. You do not benefit from demonizing “downtime”
The goal of “frequent boring deployments” is, in a funny way, a positive goal. It’s a goal to be able to do something, and to become better and better at it.
The goal of “zero downtime,” by contrast, is an anti-goal to eliminate something. Humans are political animals, and this type of goal easily encourages us to play games to avoid coming into contact with the dreaded, blame-attached thing called “downtime.”
Reality check: some changes require downtime
There are some technical reasons why zero downtime is impossible to attain. Some of these reasons are related to concurrency and data access: for certain types of changes, there are limits to what you may do to the structure of the database while reads and writes are being performed against the database.
We can certainly work within these constraints. The most popular way to work towards “boring” deployments is to use an “expand / contract” model of changes, also known as a “parallel change” model.
In the “expand/contract” model, we use three phases:
- Phase 1 – Expand: Introduce changes to existing structures in a backwards compatible fashion.
- Phase 2 – Transition: Modify existing code to use the new structures. This is often combined with feature toggles in applications with the idea that if anything goes wrong, the feature toggle may be easily disabled by a business user.
- Phase 3 – Contract: Remove / clean up older structures.
The “expand/contract” model is an old friend of mine: I got to know this pattern well before I ever knew it had a name. Teams I have worked on have found great success in embracing this model. (Note: for an engineer’s take on implementing this well, I highly recommend Michael J Swart’s series of articles. While the series focuses on the Microsoft Data Platform, many of the points can be adapted for other database platforms.)
However, teams I worked on consistently found that some changes require downtime, because…
1. Downtime sometimes makes sense, even in organizations obsessed with customer experience
At times the risk of downtime is more acceptable to the business than the cost of reworking the change into smaller steps with transitions. This is really a cost-benefit analysis, and discussions in this area tend to be more productive when the concept of “downtime” is not heavily politicized.
As I’ll cover in a bit, the word “downtime” itself lacks nuance, and some types of “downtime” may be acceptable, or the customer experience of the downtime may not be problematic.
2. The real world is unpredictable: you don’t get to plan every change in advance
I don’t think Samuel Beckett was thinking of database operations when he wrote, “Try again, fail again, fail better,” but I’ve long felt that this is the perfect quote to describe what I know about working with databases. Failure is inevitable, but we can continuously learn to fail better.
Another way to say this is that databases are interesting systems. No matter how clever teams are at planning deployments, production environments are not fully predictable:
- Changes in data sizes and distribution may dramatically impact performance and even availability of databases
- Users of applications may use features in ways that weren’t expected, thereby skewing database workloads in unanticipated ways and causing resource or concurrency problems
- Data replication technologies used for scaling out workloads, high availability, and disaster recovery adds significant complexity to both managing performance and to ensuring that users read consistent (aka “correct”) data and that user writes are successful and durable
This unpredictability means that some deployments will need to be planned and executed quickly. In these situations, you don’t have time for anyone to argue about what “downtime” means. If “downtime” is already occurring it doesn’t help if that word causes panic within the hearts of your team.
Instead, we want people to already understand which aspects of the customer experience are the most critical, and to also know or be able to quickly identify how to prioritize the effort to restore customer experience.
Many attain “zero downtime” by redefining “zero” or “downtime.” This doesn’t do you any favors in the long run.
I believe that we still have discussions about “zero downtime” for database deployments simply because many people have diluted the word “downtime” into something closer to “zero downtime… for EXTREMELY peculiar values of downtime.”
This came about because downtime is hard to define when you look at it closely:
- If users can read, but not write, does that count as downtime?
- If applications are up and performing verrrrryyyyy slowly, but not failing, is that downtime?
- If the downtime occurred when the system was minimally used, does that count as downtime?
- If less than N% of customers were impacted, does that count as downtime?
- If [Critical Features] are performing fine and [Non-Critical features] are impacted, does that count as downtime?
- If we plan downtime windows in advance, can we claim to be “zero downtime” because we now say that downtime implicitly means “unplanned downtime”?
These are all great questions. It’s worth discussing all of these things and defining classes of service and what requirements are for customer experience.
However, I don’t think we do our teams or our customers any favors by answering these questions and tying them all to the tired, overburdened term, “zero downtime,” and then having to train every new customer or new hire on our super-boutique, crafted meaning of “zero downtime.”
It’s much more productive to find a new language, and better goals.
Instead of chasing “zero downtime,” what’s the best approach?
There are three parts to a better approach.
Solution 1. Define your Service Level Agreements clearly
I suspect that some of you are shaking your heads in frustration and thinking that your management chain demands a certain number of 9’s for uptime, and you can’t change that.
I get it. For a variety of reasons, we often need to:
- Simplify availability metrics down to a simple number (such as a “number of 9’s” of uptime)
- Share this number with organizational stakeholders and external customers
In order to do this successfully, you will need to work out the answers to a load of questions like I listed above in order to clearly determine what is “uptime” and what is not.
You are well served in making SLA commitments visible to the team, and making it very visible for the team as to how the team is doing on meeting those commitments over a variety of periods of time.
But note: this may all be done while still focusing the team on goals that work towards positive goals which are more easily conceptualized and remembered, rather than focusing the team on a goal against a nebulous concept which has been defined in a very specific way with a lot of hard-to-remember details.
Solution 2. Focus on global outcomes that map to both speed and stability
The 2019 Accelerate State of DevOps Report recommends tracking not only Availability, but also four key metrics. The first two of these metrics relate to deployment tempo, and the second two relate to production stability:
- Deployment Frequency – How often do you deploy changes to end users / production?
- Lead Time – How long does it take to go from code complete to successfully running in production?
- Time to Restore Service – How long does it take to restore service when an unplanned incident occurs?
- Change Failure Rate – What percentage of changes release results in degraded service or a flaw that requires remediation?
A key thing to notice is that the report finds that tempo and stability reinforce one another.
In other words, deployments that fix a mistake caused by the prior deployment don’t simply “add 1” to the deployment column and negate a “-1” in the change failure rate column.
Another key thing to notice is that these goals work best when both IT and operations teams share the same goals. Separating out the “stability” goals and assigning them to one group while another group is in charge of the “tempo” goals simply reinforces the classic silos between IT and operations and pits these two groups against one another.
Solution 3. Move towards small, backwards compatible deployments and the “parallel change” model
As I mentioned above, I’m a fan of deploying to live systems, and in making small, incremental changes that are not even noticeable to customers.
In my experience, designing these deployments, building test scenarios, and creating pipelines and automation that ensure the right folks review the right changes takes a bit of time. A team doesn’t go from quarterly “big bang” database deployments to daily “parallel change” deployments overnight.
The good news is that many of the patterns and practices used for application code are directly transferable to database code. Your team doesn’t need to reinvent the Agile wheel in order to make progress with database development and deployment.