The CHECK() Constraint

Comments 0

Share to social media

One of the many ways a relational table differs from the file structures used by pre-relational storage systems is that the tables, rows and columns can have constraints on them. This allows you to reduce the types of bad data that can be loaded into your tables. This lets the database do a lot of the work that you would’ve had to do an application code in the old days (and current times when using file-based storage).

This is an incredible advantage. With COBOL or FORTRAN programs, the data storage platform did nothing to take care of itself. Everything was in the application code. If you talk to an old timer, will find stories about mounting the wrong tape on a tape drive, and then running a program on it that made no sense. This occurred because a lot of times, files were fixed width 80 column card images, so the application program would simply take those images and partition them into fields. Please remember that in SQL, each column has a known data type in and of itself; this datatype is not dependent on the query using that table.

One of my favorite horror stories involves programmers mounting a tape from a medical machine and running the data through the hospital payroll program. The program got remarkably far before failing. Of course, by that time, the printer had destroyed a lot of blank checks.

The most basic constraint: PRIMARY KEYs

The problem is that newer SQL programmers (and even some more experienced ones) don’t take advantage of these constraints. Consider this slight rewrite of an actual posting on an SQL forum. In the original, the poster thought that identifiers should be integers and that ISO–11179 naming rules do not apply, so I cleaned it up a little, but this will make my point.

What’s the first thing you notice about this table? It ought to be that this is not a table by definition! Remember your very first classes on RDBMS? There’s no key. A key is not an option; this is the definition. Since all the columns in this first posting can be NULL, that DDL could never be correct. Most relational products allow this, which has some utility when moving data around, but this is not how a proper table ought to be designed.

Let’s assume a patient can have more than one appointment, just not in the same day. That means the key for this table must be made up of at least two columns, a patient and a date. First, we must make those two columns NOT NULL. There’s also an assumption of only whole days being used for appointments, let’s ignore that for now and then worry about time slices later.

All it takes is a simple “cut and paste” to get the column names into a PRIMARY KEY () constraint declaration. Yet if you go on virtually any SQL forum, you’ll find that most of the posters have not done this.

Checking for more than uniqueness

There’s nothing in this DDL to keep an appointment from ending before it begins. We can fix that with a CHECK() constraint.

Now the user cannot put in an appointment with a start date greater than the end date. However, the appointment_end_date column allows NULL values. How will this affect the use of the CHECK constraint?

I am always surprised by how many SQL programmers don’t know the answer to this question about UNKNOWN results in the DDL. If the Boolean expression of a CHECK constraint’s search condition comes back UNKNOWN as result of having a NULL in a test, then it’s treated as if it were TRUE, and the search condition accepts the row. Think of it as a “benefit of the doubt” feature in SQL. This means that when I have an appointment and I don’t know when it’s going to end, I can use (SET appointment_end_date = NULL) to model an open-ended appointment.

If you wish to limit some data based on NULL, then you have to specifically include that in your CHECK constraints. So if you want to say that appointments after ‘2024-01-01’ are required to have an appointment end date, you can add another CHECK like: (appointment_end_date is not NULL AND appointment_start_date >= '2024-01-01'). This leaves open the appointment_start_date to be NULL or not NULL when appointment_start_date is earlier than '2024-01-01')

Performance benefits of CHECK constraints

I am also surprised how many SQL programmers don’t know that the search conditions in the CHECK() clauses are used by the optimizers. The more you can put into the DDL, the better the DML will run.

Of course, CHECK constraints do have a slightly negative effect on INSERT statements but the positive effect they have when queries are executed that contradict them is worth it, speaking only about performance.

But even with no performance gains, it means if we’re applying our business rules in one place and one way. There is no need to worry about the same rule having two or more definitions in the system. The fact that the data matches the conditions is worth a few micro-seconds checking the data.

Defaults can help you too

Another feature which is very handy but is way underused is the DEFAULT clause. This feature says to use the default value in insertions when no explicit value is given. Normally, the “default default” is NULL without an explicit default value, but this assumes that the column involved is NULL-able.

A lot of the time, this is not the case in your data model. The default can be a zero, a one, the current timestamp, the current user or some special code The most common conventions are “NA” for not available, “TBD” for to be determined, “NMI” for no middle initial and so forth. In some cases it can be difficult to choose a value to mean none, because these values could be confused with initials of Neal Anderson, Texas Belly Dancers, and No More Ice. Having the default set makes it easier for people to meet some CHECK constraints that require a certain set of values.

These conventions were derived in the days of paper forms and punch cards because simply leaving a blank space might tell you that you simply had not filled in the form completely or that there really was a missing value.

Don’t forget the moniker

Finally, a CHECK() constraint can (and should) be named. This name will show up in error messages and can be used to reference the constraints in the DDL. Again, this might not be a big deal when you’re posting in a forum or writing test code, but it’s an incredible help when writing production code, ideally not right in your production database. Having the same name in each environment helps you know when something is the same constraint, and when it is not.

The CHECK() constraint can be attached to the end of a column declaration if the constraint applies only to that column. An example of that might be CHECK (order_qty > 0) or CHECK(order_qry >= 0) on the order quantity column in a table that models commercial orders.

In fact, in general , you should frequently have CHECK() clauses that define the domain for a value. For example, for quantities equal to or greater than zero like this because most SQL implementations do not have an integer data type for natural numbers. There may be reasons to allow negative values for quantity, so be sure and check your requirements first.

Another horror story from the early days of SQL was a schema that incorrectly allowed for negative order quantities. Such things were processed as refunds. One of the programmers found this design flaw, refunded themselves thousands of dollars, and then left the company. A very crooked early retirement plan! These simple CHECK() and the NOT NULL constraints should be automatic if you’re going to be a good SQL programmer.

You can have more than one CHECK

When I taught SQL classes, there was a misunderstanding that the table could have only one CHECK() constraint. This is not true, and what you want to have is one constraint for every condition you wish to CHECK(). Little harder to see is that a single column can be referenced in more than one CHECK() constraint. It is all right to write something like:

The named hot and cold porridge constraints overlap the general porridge temperature constraint. This is not a problem, and it simply means that more than one constraint can be violated in a time. Whether or not you want to use two constraints as shown, or one with an AND like: (porridge_temperature <= 55 AND porridge_temperature >= 45) can come down to error messages. Was the value too hot or cold? Combined the name of this constraint might be “improper_temperature_porridge”, which works, but doesn’t indicate it as well as two constraints.

At this point you probably need to know something about your particular SQL product and how it’s going to present this information to you. Many products will give you only the first error that occurs, so if you violate 20 of them, it can be interesting to work through them one at a time.

An argument against constraints

One argument against using the database to protect data is that it feels like it should be part of an application and not a table declaration. This article did not, nor will it, broach the subject of application code and it’s needed to protect data.

Some logic legitimately needs the user’s feedback. Some are very time based. Some need a manager to allow overrides to the basic business rules. Most of the rules you put in the database ought to be a part of the application as well. For example, if you do have 20 constraints on a table, you will very likely cover most of them in the UI. Username required, Last Name required, etc. You may even have optional rules that you want to ask the user about “You did not include a First Name, was that on purpose?”

But every bit of data protection you put in the database will work in the first, second and one hundredth UI. Also when the new DBA starts and imports a load of data… but has the punch cards upside down.

Give it a try

A final consideration is that SQL has a lot of useful predicates. I’ve shown you sample code with the BETWEEN, but we didn’t get into using string and mathematical functions; nor did we delve too deep into complex Boolean expressions to construct elaborate pieces of logic. In fact, pretty much any expression that returns a Boolean can be put into a CHECK constraint, depending on your SQL engine’s implementation.

As an exercise, why don’t you try to write a simple check digit constraint for a UPC code in the DDL of the table? It will be a little messy, but once it’s in place you’ll never have to do it again.

 

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.