Unique Experiences!

You'd have thought that a unique constraint was an easy concept - Not a bit of it; it can cause a lot of subtle problems in database designs. Joe Celko goes over the ground of unique keys, primary Keys, foreign keys and constraints.

This is my second article on DDL tricks. The first article (‘Constraint Yourself!‘) dealt with DDL constraints for one or more columns within the same table. But the “unit of work” in SQL is not the table! It is the entire schema. In file systems, each magnetic tape or disk pack was physically and logically separate from the other files in the enterprise. It was the restrictions of those “silos of data” that provided the motivation for developing databases. Enterprises were like a man with a thousand mechanical watches and clocks who then suddenly needs to have them all synchronized and accurate to a micro-second. Oops! Impossible.


Enterprises were like
a man with a thousand
mechanical watches
and clocks who then
suddenly needs to have
them all synchronized
and accurate to a
micro-second.
Oops! Impossible.
                     “

Data modeling is related to, but not actually part, of RDBMS. In an ideal world (especially if you are a data modeler or a enterprise level DBA who wants full employment) your entire enterprise would have a single, honking big data model, most of which could be programmed into a database. It leads to the idea that data elements have one, and only one, name and one definition over the entire enterprise or (better) the entire industry (i.e. VIN applies to all automobiles in the world).

At a basic level, we divide tables into those that either model entities or model relationships among those entities. A table cannot do both at once, please. SQL does not have explicit syntax for these concepts, so you have to do it yourself. The way you do this is with a relational key. Yes, you can declare a “table” in SQL without a key at all, in violation of the relational model. This means that you have a file system that happens to be written in SQL. This misses the whole point of RDBMS, and costs you all the advantages we gained from replacing file systems.

What is a Key?

Going back to foundations, every table has to have one or more keys. A key is defined as a subset of the non-NULL attributes (columns) in the table which are unique for every row in the table. A key can be one or more columns. Please notice this is a logical concept. A key has nothing to do with how the data is physically stored. A key is not the physical location of the data like a record number on a tape or a track and sector number on a disk drive. They have nothing to do with the order of insertions into physical storage, so the SQL Server IDENTITY and any other auto-numbering device cannot, by definition, be a relational key.

Tables have no concept of ordering like a magnetic tape file or other sequential storage media. Rows are not inserted, updated or deleted from tables. Entire sets of rows are inserted, updated or deleted from tables; if the set involved in the operation has zero, one or more rows in it, this is just fine.

However, go to newsgroups and see how many posters refer to the “next” or “previous” row in a table. This is a major conceptual problem left over from sequential files. Early programming languages (that means FORTRAN and COBOL) defined their data in terms of contiguous physical storage in their standards (Algol was the first abstract language standard). The split of logical versus physical is very hard for most people. In good SQL products, the physical location can be changed by the system on the fly, while the logical key is always the same. Bytes did not move around on their own in file systems!

In RDBMS, a key is how you find an instance of an entity (the table of people, like John and Mary) or a relationship (John & Mary’s marriage). When Dr. Codd first invented the Relational Model, the sequential file model was the only mindset we had. You do have a “next” or “previous” record in a sequential file. Locating data depends on a sorted so you can find records without having to do a random search. This was especially true for punch card decks which could be read in one and only one direction.


If nothing else, when
you dropped a tape
on the floor, you did
not scream and beat
your head into a wall.
                     “

While we make fun of magnetic tapes today, they were the next advance in sequential file systems. If nothing else, when you dropped a tape on the floor, you did not scream and beat your head into a wall. Magnetic tapes depend on sorted ordering — you do not merge tapes randomly. Trust me, if all the tapes are not sorted the same way, then you are dead in the water. Please do not ask how I learned this simple truth; it is embarrassing (forget one field in the sort key and it almost works; almost is a deadly word, but I am a dinosaur and I digress).

That mindset continued when Dr. Codd invented the Relational Model, so he required that a table have a PRIMARY KEY. This mimicked the sort order key of a magnetic tape. That became the PRIMARY KEY concept in SQL engines, which worked well with sequential access and indexed file systems which early SQL products were built upon. Later, Dr. Codd realized that a key is a key as a key. Oops! Now it is too late and PRIMARY KEY is part of SQL. The file systems upon which early SQL were built could use that info for optimizing data access, so we left it in the language.

ANSI X3H2 Database Standards Committee added the UNIQUE constraint and gave us some help. This approach is called Declarative Referential Integrity (DRI), in contrast to enforcing data integrity via procedural code such as triggers, stored procedures and application code. A UNIQUE constraint can be can be one or more columns. You can have as many UNIQUE constraints on one table as you wish, but only one PRIMARY KEY. A PRIMARY KEY acts like a UNIQUE NOT NULL constraint and it is the default for foreign key references. But this does not imply that UNIQUE has a NOT NULL constraint with it. We debated this point in the committee for awhile.

There are two kinds of equivalence relations in SQL. The first one is simply a binary “equal” or = as defined for each data type. The only surprises are the rules for VARCHAR(n) data types. Standard SQL is case-sensitive and has various collations which can get elaborate, but the rule for equality is that the shorter string is padded with blanks then compared position by position with the longer string — that means (‘Smith’ = ‘Smith ‘) is TRUE. Not all programming languages work this way.

The second kind of equivalence relation is grouping, as in GROUP BY. This clause reduces subsets to single values based on equality; it additionally puts all the NULLs into one group, which makes sense with the way GROUP BY is used — do you really want to have each NULL in its own group? A NULL should have one and only one meaning in the context of the attribute to which it belongs. The UNIQUE constraint is defined with grouping and not equality, so you can have one and only NULL in its columns. Once you are in the SQL mindset, this makes perfect sense.

The bad news is that SQL Server can go around this model with indexes instead of constraints and allow multiple NULLs. This also leads to thinking that uniqueness is somehow related to indexes. In fact, traditional tree structured indexes are far from the best approach with large amounts of data. Perfect hashing algorithms are a hot research topic in RDBMS, and require only one probe to determine uniqueness.

REFERENCES clauses

The real power of PRIMARY KEY and UNIQUE constraints is using the REFERENCES clause to enforce relationships among tables in the schema. But here is where you need to change your mindset again. A reference in RDBMS is not a pointer, a link or anything physical; it is an abstract concept. The idea is that a unique value appears once and only once in the schema, but it can be referenced in many places. Unfortunately, SQL Server repeats the physical data in the referencing (foreign key) tables, so it is hard to think of it correctly.

In fact, most new SQL programmers do not know that a UNIQUE constraint can be referenced just like a PRIMARY KEY. The only difference is the possibility of a NULL in the referenced table; this is one reason we allowed only one NULL in a UNIQUE constraint. A referencing row should reference one and only one row in the referenced table (say that three times fast!).

Another feature that beginners overlook is the multi-column foreign keys. In particular, anyone who does not know what a key is will use IDENTITY to mimic a pointer as if he were still in a 1970’s network database. Pointers are scalars, so they never develop an abstract view of data. Pointers are of one type while keys can be all kinds of things! Too much thinking in RDBMS! A classic example of this is a table like:

The city name might be unique, but (city_longitude, city_latitude) certainly is. But that would mean having to think of the co-ordinate pairs as single atomic unit of unique data. So much easier to type two letters than to have to cut and paste long names, I guess. The proper design would be

Now other tables can reference my list of cities. My GPS system which I use to locate bars might have this structure:

DRI Actions

So far, violating a reference, such as trying to add a bar whose co-ordinates are not in my list has returned an exception and done a ROLLBACK. But RI can also take actions. When a change is made to the referenced table, the SQL engine can automatically make changes to the referencing tables for you. The two most useful options are ON DELETE CASCADE and ON UPDATE CASCADE.

The ON DELETE CASCADE option will remove all the rows that reference the deleted rows in the referenced table. For example, if I delete an order, I would want all the order details to disappear, too. If I do not do this clean up, the order items will orphans, floating around and messing up my reports, eating up storage space and in general making a mess.

The ON UPDATE CASCADE option will update all the rows that reference the updated rows in the referenced table. For example, if I change an item number in the Inventory, I would want all the order details to also reflect that change. If I do not make the change, then reports are wrong, and I have a non-existent item in my orders.

A possible problem is that I can have a cycle of cascaded actions if I am not careful. Imagine that Table A changes Tables B and C via a DRI action. Furthermore, the changes in Table B cascade to Table C via a DRI action. How does a row in Table C look after all of this? The first attempts at putting this feature into SQL were not good; the answer was non-deterministic.

At another extreme, a table can reference itself, either on the same column or on different ones. This can result in cycles of elaborate cascades and be difficult to understand. But it is useful for certain integrity checks. Let me use a history table for price changes. The fact is that a price had duration. This is the nature of time and other continuums. So a basic history table looks like this:

The self-reference will prevent gaps in the history. You just have to handle it in a stored procedure that updates the current data and inserts a new row when a price changes.

Overlapping Uniqueness Constraints

Uniqueness constraints can overlap. A common technique is to model class hierarchies with overlapping uniqueness constraints. As an example, let’s take the class of Vehicles, find an industry standard identifier (VIN), and add two mutually exclusive sub-classes, Sport utility vehicles and sedans (‘SUV’, ‘SED’).

Notice the overlapping candidate keys. I then use a compound key (vin, vehicle_type) and a constraint in each sub-class table to assure that the vehicle_type is locked and agrees with the Vehicles table. Add some DRI actions and you are done:

Yes, there is redundancy between the two keys, but many SQL products will construct separate access structures that give the optimizer options. I can continue to build a hierarchy as deep as I wish following this pattern. If a vehicle can of more than one type, you can use “CHECK (vehicle_type IN (<list>))” instead. A better example for overlapping unique constraints would involve a typical class schedule for a school.

The rules we want to enforce are:

1) A teacher is in only one room each period.
2) A teacher teaches only one class each period.
3) A room has only one class each period.
4) A room has only one teacher in it each period.

The ideas are simple and clearly deal with uniqueness. One answer is to have two overlapping UNIQUE() constraints that share the period.

In closing, I invite you to look at a puzzle posted on this website by Lionel Clarke. You are given two tables which have two columns each. The first table has unique constraint on each column. The second table has one unique constraint on the pair of columns. The goal is to insert rows from one table to the other in one statement.

There are six possible subsets of five rows each in the Source, so this is trickier than it looks. The answers there will give you some ideas for using the power of uniqueness for data integrity.