Constraints and declarative referential integrity

Database systems have powerful features built right in that can ensure the integrity of data. In this article, Joe Celko discusses keys and constraints.

When people post examples on SQL forums, they don’t post tables. By definition, a table must have a key; look at how often it’s left off. Likewise, look at how often the columns are NULL-able so that it’s not physically possible for their posting to ever have a key. They are not posting tables but punchcard file definitions from the 1960s written in SQL. One of the most neglected features, as well as the most useful, is declarative referential integrity (DRI).

Before we had RDBMS and SQL, we had network or navigational databases. These were based on the work of Charles Bachman, who received an ACM Turing award for his work. Conceptually, the navigational model imagined that the program was traveling along pointer chains to nodes in a graph, where it would pick up a piece of data in sequence. This model is still stuck in people’s minds today, even though it’s not relational.

Keys

By definition, a table must have at least one key. A key is defined as a subset of columns in the table that is unique for every row. When someone posts an example on a newsgroup and tries to tell you that they want to remove “duplicate keys”, you know they have no idea what they’re talking about.

The concept of a PRIMARY KEY is actually a little weird. The reason it exists at all is that the early versions of SQL were built on top of existing filesystems. A file has a physical ordering and a preferred access method in Bachmann’s model. In the SQL Server world, we inherited the IDENTITY table property from the UNIX file system that Sybase used. It is not a column since it does not model an attribute of the entities in the table. Because of that, it can never be a key. When I’m trying to teach RDBMS to people, I tell them that the identity property is like the parking space number in the garage. You can use it to locate your car at the current moment in the current garage, but it’s not an attribute of the automobile itself; that’s why we have VIN numbers.

CREATE TABLE statements

The CREATE TABLE statement does all the hard work. The basic syntax for base table declarations looks like this:

A base table is a table that has an actual physical existence in some kind of persistent storage. The virtual table is just as real to SQL as a base table. I’m not going to bother with discussing temp tables and other ways the table can come into existence.

Column definitions

Beginning SQL programmers often fail to take full advantage of the options available to them, and they pay for it with errors or extra work in their applications. A column is not like a simple passive field in a file system. It has more than just a data type associated with it.

Let’s take a quick look at the column constrain options.

DEFAULT clause

The default clause is an underused feature whose syntax is

Whenever the system does not have an explicit value to insert into this column, it will look for its DEFAULT clause and use that value. The default option can be a literal value of the relevant data type or something provided by the system, such as the current timestamp, current date, exit value from a SEQUENCE, etc. For this article, don’t worry about how sequences work. If you do not provide a DEFAULT clause and the column is NULL-able, the system will provide a NULL as the default. If all that fails, you will get an error message about missing data.

This is a good way to make the database do a lot of work that you would otherwise have to code into all the application programs. The most common tricks are to use a zero in numeric columns, a string to encode a missing value ('{{unknown}}') or a true default ('same address') in character columns, and the system timestamp to mark transactions.

NOT NULL Constraint

The most important column constraint is the NOT NULL, which forbids using NULLs in a column. Use this constraint automatically, then remove it only when you have a good reason. It will help you avoid the complications of NULL values when you make queries against the data.

The NULL is a special value in SQL that belongs to all data types. SQL is the only language that has such a creature; if you can understand how it works, you will have a good grasp of SQL. In Relational Theory, the NULL has no data type, but in SQL, we have to allocate storage for a column that has a data type. This means we can write CAST (NULL as <datatype>) in our code.

CHECK() constraint

The CHECK() constraint tests the rows of the table against a logical expression, which SQL calls a search condition, and rejects rows whose search condition returns FALSE. However, the constraint accepts rows when the search condition returns TRUE or UNKNOWN. This is not the same rule as the WHERE clause, which rejects rows that test UNKNOWN. The reason for this “benefit-of-the-doubt” feature is so that it will be easy to write constraints on NULL-able columns.

The usual technique is to do simple range checking, such as CHECK(rating BETWEEN 1 AND 10), or to verify that a column’s value is in an enumerated set, such as CHECK(sex_code IN (0, 1, 2, 9)). Search conditions should be limited to the table, and all search conditions are always true on an empty table. You can also only write search conditions that can be tracked and tested consistently; that means you can’t grab the current system date in a CHECK() clause since it’s dynamic. CHECK() constraints are usually attached to one column, but you can use the named constraint to CHECK() multicolumn conditions.

While it is optional, it is a really good idea to use a constraint name. Without it, most SQL implementations will create a huge, ugly, unreadable random string for the name since they need to have one in the schema tables. If you provide your own, you can drop the constraint more easily and understand the error messages when the constraint is violated.

The real power of the CHECK() clause comes from writing complex expressions that verify relationships one way, one time, and in one place. To give a common example, consider two programmers get specs, and one of them reads the requirement as “we must have at least one foobar in inventory”, while the second programmer reads the same requirement as “zero or more foobar’s in inventory” Instead. These translate into CHECK (foobar_qty > 0) and CHECK (foobar_qty >= 0) in code. Back when this code was Fortran, BASIC or COBOL, both constraints could appear the same system, in different procedures. In SQL, however, we can quickly check and see which constraint is used in which table, regardless of how many hundreds or thousands of code modules touch that table.

UNIQUE and PRIMARY KEY Constraints

The UNIQUE constraint says that no duplicate values are allowed in the column. It comes in two forms. The syntax is:

<UNIQUE specification> ::= UNIQUE | PRIMARY KEY

There are some subtle differences between UNIQUE and PRIMARY KEY. There can be only one PRIMARY KEY per table but many UNIQUE columns. A PRIMARY KEY is automatically declared to have a NOT NULL constraint on it, but a UNIQUE column can have a NULL in a column unless you explicitly add a NOT NULL constraint. A UNIQUE column can have one and only one NULL and it. But now things get tricky.

Adding the NOT NULL whenever possible is a good idea, as it makes the column into a proper relational key. I also add NOT NULL to the PRIMARY KEY to document the table and to be sure it stays there when the key changes.

File system programmers understand the concept of a PRIMARY KEY, but for the wrong reasons. Their mindset is a sequential file, which can have only one key because that key is used to determine the physical order of the records within the file. There is no ordering in a table; the term PRIMARY KEY in SQL has to do with defaults in referential actions, which we will discuss later.

There is also a multiple-column form of the UNIQUE and PRIMARY KEY, which is usually written at the end of the column declarations. It is a list of columns in parentheses after the proper keyword; it means that the combination of those columns is UNIQUE. For example, I might declare PRIMARY KEY (city_name, department_name) to be sure that though I have offices in many cities and many identical departments in those offices, there is only one personnel department in Chicago.

Each of the columns in a compound uniqueness constraint can have its own constraints. Then it’s possible to have more uniqueness constraints that overlap. In fact, you can build a pretty complicated web of such things. The advantage of this besides the “one way, one time, one place” Is that the search conditions are passed to the optimizer. Optimizers are basically logic engines that try to give you the best execution plan, so your queries and other statements will only get better.

REFERENCES Clause

The <references specification> is the simplest version of a referential constraint definition, which can be quite tricky. For now, let us just consider the simplest case:

This clause relates two tables together, so it is different from the other options we have discussed so far. This says that the value in this column of the referencing table must appear somewhere in the referenced table’s column that is named in the constraint. Notice the terms referencing and referenced. These terms are not the same as the parent and child terms used in network databases. Those terms were based on pointer chains that were traversed in one direction; that is, you cannot find a path back to the parent from a child node in the network. Another difference is that the referencing and referenced tables can be the same table. Self-references can be a useful trick.

Furthermore, the referenced column must have UNIQUE constraint. For example, you can set up a rule that the Orders table will have orders only for goods that appear in the Inventory table. You will not need procedural code for this, but you will need code to any constraint violations and handle them as exceptions. This is where switching from a procedural mindset to a declarative mindset comes into play.

If no <reference column> is given, then the PRIMARY KEY column of the referenced table is assumed to be the target. This is one of those places where the PRIMARY KEY is important, but you should always play it safe and explicitly name a column. There is no rule to prevent several columns from referencing the same target column. For example, we might have a table of flight crews with pilot and copilot columns that both reference a table of certified pilots.

A circular reference is a relationship in which one table references a second table, which, in turn, references the first table. The old gag about “you cannot get a job until you have experience, and you cannot get experience until you have a job!” is the classic version of this.

Now I’m going to get philosophical. In the implementation, many products repeat the referenced value as a column in the referencing table. In the old days, with network databases, a pointer would have pointed to the referenced data element and not duplicated it. There are SQL products that build these pointer networks, in particular WATCOM SQL. In these products, the referencing table uses a pointer and not a duplicated value. We are actually looking at a true reference and not a copy. This is the mindset you should have when designing a schema, regardless of the implementation.

Referential Actions

The basic REFERENCES clause can have two sub-clauses that take actions when a database event changes the referenced table. The two database events are updates and deletes and the sub-clauses look like this:

When the referenced table is changed, one of the referential actions is set in motion by the SQL engine.

1) The CASCADE option will change the values in the referencing table to the new value in the referenced table. This is a very common method of DDL programming that allows you to set up a single table as the trusted source for the value of a data element. This way the system can propagate changes automatically.

2) The SET NULL option will change the values in the referencing table to a NULL. Obviously, the referencing column needs to be NULL-able.

3) The SET DEFAULT option will change the values in the referencing table to the default value of that column. Obviously, the referencing column needs to have some DEFAULT declared for it, but each referencing column can have its own default in its own table.

4) The NO ACTION option explains itself. Nothing is changed in the referencing table, and it is possible that some error message about reference violation will be raised. If a referential constraint does not specify any ON UPDATE or ON DELETE rule, update rule, then NO ACTION is implicit.

Full ANSI/ISO Standard SQL has more options about how matching is done between the referenced and referencing tables. Those two simple options will cover 80 – 90% of your work in the real world. That’s why we chose them for the language.

Full SQL also has deferrable constraints. This lets the programmer turn a constraint off during a session so that the table can be put into a state that would otherwise be illegal. However, at the end of a session, all the constraints are enforced. Many SQL products have implemented these options, and they can be quite handy.

.

In SQL Server, we can turn a CHECK() constraint on or off explicitly, Instead of declaratively. In particular, if a table has to reference itself, you can then turn off CHECK() constraint, insert an initial row, turn on the CHECK() constraint and begin processing. If you’d like to see how handy this can be, look at this old article I wrote on contiguous time periods. The trick is from Kuznetsov. It guarantees that as rows are added to a table of events (event begin time, event end time), there are no gaps or overlaps. It is all done declaratively. Here’s an article with an example.

Conclusion

In a declarative language like SQL, most of the work for your schema should be done in the DDL, and not in the DML. Don’t get lazy and keep writing your SQL as if you were still using COBOL or Fortran 50 years ago. You might want to consider going through some of your old code and adding some more clauses to it to see if things improve.