Developing Low-Maintenance Databases

Alex's team of developers are geared to doing rapid development of database applications in a busy corporate setting, yet take considerable time over meticulous database design, extensive constraints, automated tests, error logs, and defensive coding. Why? Because it cuts down on the subsequent need for maintenance

The Maintenance task

It can cost more to maintain a mature software application in use than it did to develop it in the first place, and this is particularly the case when we are maintaining systems built on top of an RDBMS. To keep an application in use, both developers and DBAs may be forced to spend considerable time on maintenance tasks such as

  • Fixing bugs and deploying the fixes
  • Changing the functionality because the requirements have changed
  • Cleaning up the data
  • Dealing with concurrency issues; troubleshooting deadlocks, for example
  • Speeding up slow queries

If the system that we develop then experiences many of these problems in production, it then also erodes both the users’ confidence and the teams’ morale, on top of the raw cost. When our users encounter issues or deal with inconsistent behavior of the system, it prevents them from completing their tasks effectively. When we who are tasked with maintaining the system have to fix so many problems on top of developing new features, it is bad for our work-life balance. Also, the time spent troubleshooting and fixing is the time not spent on doing other, usually more interesting and productive, work. As a result, unreliable systems may eventually lose both users and developers.

This is avoidable.

Writing Maintainable Code

If an application is intended to provide a long-term solution, it must be maintainable. To achieve this, it usually makes sense to use development practices that ensure that our system is robust and easy to change, thereby reducing some of the maintenance burden. This is especially true in Agile environments: Here, we are always just a few days away from the next release. It is difficult to justify spending a day on the interesting challenge of tuning queries or troubleshooting deadlocks when all the pressure is to concentrate on tomorrow’s deployment.

Roughly four years ago, my team and I began using Agile methodology, and soon it became clear that, in our environment, it usually pays to develop highly robust solutions that “just work” with little in the way of maintenance, and are easy to modify when this is required. Over these years of intensive Agile development, I have come up with a set of approaches which allows us to do this.

In this article, I’m continuing my usual practice by sharing my actual experience and describing only the approaches that actually work in real life. Although I read and learn a lot, and much of the content in this article has been learned from many people and many sources, everything I have described works over the long term in our system.

Steps towards making databases more maintainable

I am not going to make any blanket recommendations or hard-and-fast rules – those solutions and approaches that make sense in our environment might be an overkill or just wrong in some other circumstances. Yet I hope that some of the advice in this article might be useful for somebody else, so let me share the steps we have made to reduce maintenance costs by using better development practices.

Insulate the database behind a well-established interface.

We do not expose tables and views to our users. Instead, we provide an API of stored procedures. This allows us to change the procedures or the underlying tables without the risk of breaking applications. A little bit of extra work needed to wrap a DML statement in a procedure is a highly useful insurance against possible changes later on, making the changes down the road much easier.

Further in this article we shall discuss many examples when this insulation proves to be extremely useful.

Note: We decided against exposing views because we want to be able to break a complex query into smaller and simpler parts when needed, storing intermediate results in table variables or temporary tables.

Automate database testing, and maintain good test coverage

If longevity is required of a system that is built on top of an RDBMS, it is essential to have a rigorous harness of automated tests. Our automated tests allow us to dramatically reduce the number of bugs. Also they enable us to change our system much more easily, further reducing maintenance costs. For more detail, refer to this article: http://www.simple-talk.com/content/print.aspx?article=1419.

According to Darwin’s theory, it is not necessarily the ones who are currently the strongest and the fastest who survive in the long term. It is the most adaptable species that eventually wins because they retain the ability to evolve and adapt to changing requirements.

Without automated test harnesses, it is difficult for RDBMS systems to survive by adapting to changing environments

Keep database design up-to-date

Database design is the foundation on top of which we build the rest of our systems. If we manage to get it right, and keep it right when the requirements change, we can save a lot of development and maintenance costs.

If we fail to keep database design up-to-date as we change the system, we must use workarounds, a.k.a kludges, a.k.a duct-tape solutions to get the job done. Typically, such workarounds will slow down the process of changing the system, increase its complexity, hurt the performance and, as a result, drive up maintenance costs.

In other words, developing against a wrong database schema is as slow and painful as running a marathon in dress shoes that are two sizes too small. As our feet and our requirements change, the shoes and the database design should adjust accordingly.

So, to keep maintenance costs low, it is essential that we have high quality database design that correctly reflects our current requirements, and adjusts as needed when the requirements change.

In theory, developing databases is no different from other kinds of development: if we want a long-term solution that is easy to maintain, then the design must be done properly, without taking shortcuts and cutting corners. In practice, however, refactoring databases, especially large ones, is difficult, so it is very tempting to go for an easy workaround rather than spend more effort doing it right. Unfortunately easy workarounds in database design cause systems to become very slow and brittle, so they usually cost a lot more in maintenance.

Use proven techniques in database refactoring

When we need to refactor tables it makes a lot of sense to learn from others’ experience. The following book is essential reading for this: Refactoring Databases: Evolutionary Database Design (paperback) (Addison-Wesley Signature Series (Fowler)) by Scott J Ambler and Pramod J Sadalage, http://www.amazon.com/Refactoring-Databases-Evolutionary-Database-Design/dp/0321293533.

Design tables defensively

In some cases, it is possible to design tables defensively, so that they are less likely to need change, and it is easier to change them when the need comes. Suppose, for example, that right now the following table meets our current needs perfectly well (for brevity, we have skipped the check constraint which verifies that the number is in the proper range, as well as other columns not relevant to our example):

In the short term, we can use this table as-is, and add child tables that refer to this natural primary key with foreign key constraints. However, there is some doubt that the US Social Security Number will remain our primary key for the long term.

We might, for example, have to expunge the SSN from our system altogether for privacy reasons, triggering a very expensive and risky refactoring. This has actually happened. Additionally, we might want to expand to other countries, and we have no idea if citizens in all other countries have unique identifiers issued by the government, nor we know if all such identifiers are numeric, or unique.

To minimize the risk of an expensive refactoring in case of such a change, we might want to add an integer identity column named AccountId to our Accounts table, create a primary key on it, and have all child tables refer to it. Surely this identity column is not really necessary right now, and having it incurs the following overhead: the table uses up more storage and the DML against it gets slower. However, if the situation changes as we discussed before, and we need to refactor our database, this refactoring is going to be much easier.

This is a typical defensive approach – we are not squeezing every last bit of performance out of our system right now. Instead, we are sacrificing some performance to avoid an expensive refactoring or to make it cheaper. Similarly, we are driving cars equipped with seat belts, air bags, anti-lock brakes and such, although without all these safety features our cars would be cheaper, would weigh less, and would use less gas per mile.

Let us discuss another short example of defensive table design. Suppose that we need to store confirmations issued by some third party in the following table:

Again, for brevity we have skipped a check constraint that is obviously necessary.

Let us assume that right now that all confirmation numbers are exactly six characters; so in the short term, this table design fits our needs perfectly well. We can just go ahead and create child tables with foreign keys referring to this table, and develop a lot of stored procedures with the following parameter: @ConfirmationNumber CHAR(6).

If, however, we think that the requirements might change, we might go for a wider VARCHAR column, such as VARCHAR(10), and add a check constraint to enforce that confirmation numbers are exactly six characters, as follows:

If, later on, we need to support a wider range of confirmation numbers, all we need to do is to drop the check constraints and possibly create another one. We do not have to change the column type in this table and all its child ones, and we do not need to change parameter types in stored procedures either.

Of course, this convenience does not come free – we pay with additional storage for our variable width column, and with slower DML as a result.

Refactoring live tables without downtime makes sense

Even if we are doing our best to avoid database refactoring, we can’t entirely do so. Sometimes the requirements change, and we absolutely have to change our database design accordingly. Traditionally, we are supposed to shut the system down, migrate the data to a new structure, and deploy changed applications. This traditional way of refactoring tables can take considerable time, and it can be risky.

Our team does not always shut down the system to refactor tables. In our environment sometimes it makes more sense to refactor tables while they are being written to, and read from, without disrupting any production operations.

The following reasons explain we may want to refactor tables gradually, without shutting the system down:

  • Making one big change at once is risky; to lower the risks, we want to replace the requirement for one big huge change with a series of small incremental ones.
  • Not all users may want the change – some may be perfectly happy with existing functionality. We want to be nice to our users as much as possible. For a small team, it is definitely not a good idea to force an unwanted upgrade on unwilling users.
  • Refactoring large tables takes time, and is usually done during maintenance window on weekends. Working on weekends erodes work-and-life balance, making it more difficult to retain the best talent, and making developers and DBAs less productive during normal business hours.

This is why we have learned to refactor tables without any downtime. Over the last few years, we have used this approach many times without issues, so I can confidently recommend it to anyone in a similar situation. Let us discuss in more detail how we do it.

How to refactor a table without downtime

Suppose, for example, that we have a system that stores one of the most common things in IT: tickets. Suppose that tickets are stored in a table with the following design:

Clearly a ticket can be assigned to only one person at any time. Suppose also that this was the requirement when we were originally building the system – our original intent was to make sure that a ticket can never be assigned to more than one person at any time. Also let us assume that the Tickets table is quite big, being modified all the time, and there are many reports that run off of it.

Note: I am entirely making this scenario up – I have never worked on such a system. If I am not getting some details specific to such systems right, please let me know – I’ll be happy to learn.

Obviously we would rather not change the structure of this table – lots of thing can break if we do so. Unfortunately, the requirements have evolved over time, and in the new version of our system we need to keep most of existing functionality intact, but allow one ticket to be assigned to more than one person. However, quite a few users are perfectly happy with our existing functionality and do not want any changes – and we are willing to let them keep using the old system for as long as they wish.

So, in the perfect world we would love to change the database structure, as follows:

The traditional way of deploying such a change consists of the following steps:

  • Shut down the system
  • Migration all the data to the new tables
  • Change all the database modules accessing this table
  • Possibly change some client applications and reports
  • Start the system and keep an eye on it for a few days

Unfortunately, with so many changes being done at once, the chances of having at least some issues after the change are quite high. Besides, typically there is no quick and easy way to fix such issues. For example, if one of our reports is slow after the migration, we cannot just run a rollback script and have it fast again. This is why such migrations may turn out to be major disruptions, causing frustration and burnout for users, developers, and DBAs.

Instead of making all the changes at once, we make multiple incremental changes, as follows:

  • Make sure that all the modules accessing the table are well covered with automated tests.
  • Create new tables. Alter all procedures that modify the old table, so that they modify both old and new tables.
  • Migrate existing data into new structure. Do it in smallish batches, so that it does not seriously impact the overall performance on the server.
  • Verify that the migration of data succeeded.
  • Redirect some of the selecting procedures from the old table to the new ones. Use automated tests to make sure that the changed modules are still correct. Make sure their performance is acceptable. Deploy the altered procedures.
  • Repeat the previous step until all the reports use the new tables.
  • Change the procedures that modify the tickets, so that they only access the new tables.
  • Archive the old table and remove it from the system.

It is true that this approach has many steps and is not very simple; yet we love it – it is far better than its alternatives. Each step incurs very little risk, and if needed, we can roll it back in a split second. We can set the migration aside and switch to another task, or enjoy our weekend, at any time – the system keeps working in any of these intermediate states for as much as needed. Our users do not experience any disruptions.

Enforce data integrity in the database

If we need the data to be clean, and we do not enforce data integrity, cleaning up dirty data can be a very time-consuming recurring task, a task which is never completed so that we will not have to return and do it again.

This is why it is usually very profitable to enforce data integrity: we invest some effort up front, and immediately save on much tedious maintenance. In my experience, the most efficient way to enforce data integrity is to do it in the database using trusted constraints.

Even the simplest constraints can make a huge difference if they are applied consistently. Whenever we design a table, we should be asking many simple questions, such as:

  • Can box length be zero or negative?
  • Can a movie title begin with a space?
  • Can a movie title have two consecutive spaces?

If the answer to all these questions is “no”, we should use check constraints. Otherwise, we are exposed to problems and will have to clean up the data from time to time. Besides, we might not know where the dirty data comes from and therefore spend precious time troubleshooting to ascertain this. If, however, data integrity is enforced in the database, then we get an explicit error message right away, allowing for much easier troubleshooting.

For example, a table where movie titles have no constraints may store the following different rows referring to the same movie:

  • ” True lies”, beginning with a space
  • “True lies”, having two spaces in the middle
  • “True lies”, the correct title with one space in the middle
  • Even though we have a unique constraint on movie title and year, it will not prevent us from duplicates like this.

Besides simple uses like the ones we have just discussed, I have gradually learned and used some very advanced solutions involving constraints over the years. Some of these solutions seem to be completely new, such as the one described in this Joe Celko’s article: http://www.simple-talk.com/sql/t-sql-programming/contiguous-time-periods/. We have been using such solutions for several years already, enforcing complex data integrity rules in the database via trusted constraints. It works really well for us – once we have set up our constraints, they just work, saving us a lot on data cleaning.

As we discussed before, we are not making any blanket statements here. We are not recommending that all systems enforce all data integrity rules in the database, far from that. Some databases do not really need to invest considerable effort and ensure that all data is 100%.

For example, we expect our bank statements and our medical records to be completely accurate. On the other hand, if someone is doing statistical research and one data point in a million is wrong, that is probably OK. In most cases we would probably invest less effort in enforcing data integrity for the research database.

Log all error messages coming from the database

Well designed and well implemented databases should not raise too many errors. However, when they do so, the errors must be reliably logged in a convenient location – a complete and easily accessible error log enables us to troubleshoot faster. While I admit there could be exceptions to this rule, from the top of my head I cannot come up with any – by default, we should make sure that all errors coming from the database are logged.

Develop code defensively

This is a huge topic, and I have written a whole book describing various useful defensive techniques: http://www.red-gate.com/community/books/defensive-database-programming. However, that was a while ago, and we have come up with a few other useful techniques since then. The most important one, the one that saves us most time in avoided maintenance, is that our mixed load OLTP-ish system almost never has any deadlocks. This is a huge time-saver and as such it deserves its own section in this article.

Avoid deadlocks (if it makes sense to you)

Clearly the advice in this section might not be useful for everyone, but anyway: our system could have lots of deadlocks, and several years ago it did so, just like many other mixed-load OLTP-ish systems. To save on maintenance costs, we started eliminating deadlocks, and eventually ended up with a system that has almost no deadlocks at all.

Last year, in 2011, our system experienced only five or six similar deadlocks during just a few hours. All these deadlocks followed the same scenario. We quickly fixed the root cause of those deadlocks, and there were no deadlocks for the rest of the year.

Having no deadlocks was not the goal of our activity – our motivation was to deliver a rock solid system and earn users’ respect and confidence, and to reduce maintenance cost.

Of course, another alternative would be to make sure that deadlock victims are automatically retried, so that they are not visible to our users. After careful consideration, we estimated that in our particular situation, it was cheaper and easier to eliminate deadlocks. Naturally, in other situations it might be better to retry after deadlocks. As stated before, there are no hard-and-fast rules in this article.

The following steps ensured that deadlocks do not occur in our system:

  • All readers use snapshot isolation, so that they do not embrace in deadlocks with writers. Of course, the tempdb must be able to handle this workload.
  • All writers serialize on a shared resource before actually modifying data. This is described in the following article: “Developing modifications that survive concurrency”. Naturally, the need to serialize slows down modifications.
  • We strongly prefer to modify multiple rows at a time, using TVPs and staging tables: This speeds up modifications a lot, and compensates for the slowdown caused by serializing.

Because our users are only exposed to stored procedures, we were able to implement all these steps without affecting client applications.

Anyway, we made this rather controversial decision to eliminate deadlocks based on our own benchmarks and other considerations valid in our environment. I encourage you not to follow this example without careful consideration. Make your own decision – it might make more sense for you to retry deadlocks in stored procedure or to let the client handle them. However, I do encourage you to make an informed decision, carefully comparing and benchmarking different approaches.

Ensure predictable performance of your queries whenever possible

In general, speeding up queries is a very interesting challenge. However, it is preferable to enjoy the challenge on our terms, at the time of our choosing, ideally before we release into productions.

Obviously, if we fail to manage an acceptable query performance, it can ruin our users’ day, erode their confidence, and ruin our day as well.

So, when we deliver queries that perform well right now and their performance stays acceptable in the long term, this lowers maintenance costs – we do not have to spend precious time troubleshooting.

There are many useful approaches and techniques that help us in accomplishing that. The topic, however, is huge. It deserves at least an article of its own. As such, it is beyond the scope of this, already quite large, article.

Conclusion

As we have seen, there are many useful approaches that may help us in developing low maintenance databases. Whenever we are working on systems that are supposed to last, we should consider using such approaches, because maintenance is the single most expensive activity in the life of a mature system. However, we should never ever blindly accept any recommendations – decisions that make sense in certain circumstances might not be best in other situations.

None of  the scenarios and examples described in this article are  blanket recommendations – they are examples; they demonstrate that when we are proactive we can save a lot on maintenance. I do not encourage you to use my examples – I do encourage you to make your own informed decisions that fit your requirements, based on your own careful analysis. Good luck!