UNIQUE Constraints in SQL

Here is an in-depth look at an underused constraint, UNIQUE, that can increase the performance of queries and protect data integrity.

The Relational Model did not pop out of Dr. Codd’s head all at once. And SQL has been changing since 1986, when the first ANSI/ISO Standard was issued. My keynote speech at an SQL conference in Prague in 2012 was on “Our Enemy, The Punch Card” and the theme was that in order to do good SQL, we had to overcome a mindset based on finding data by a physical location in a sequential file. This is where keys came in, but let me get to that in a minute.

Dr. Codd was just as trapped in the world of punch cards, tape files and ISAM as the rest of us. The term “key” meant the sort key of a file that was required to do any reasonable processing with the original sequential file systems. A deck of punch cards is read in one and only one order; you cannot “back space” a card reader. The first magnetic tape drives mimicked that design and did not allow bi-directional access. In fact, the original Sybase SQL Server required that cursors had to “re-wind” a table to the start and move forward to read a prior row. Yes, this was much faster on a disk drive than a tape drive, but it was the same mindset.

The model for cursors in ANSI/ISO Standard SQL was directly copied from the IBM magnetic tape file system of the day. The idea was simple; make it possible to replace COBOL and FORTRAN tape routines with SQL commands, one for one. The IDENTITY table property (not column!) was also taken from the IBM tape file model. You cannot write to a tape in parallel, so each tape record gets its record (position) number from a physical queue. When you wanted to delete a tape record, you would set a bit flag at the start of the record. The system knew to skip over it until eventually a utility program could do some garbage collection, and make a next-generation copy of the data without the deleted records. Whether or not to re-number the new sequential file depended on how the record number has been used by the application.

This why we have PRIMARY KEY in SQL and why a table must have one to be properly constructed. Dr. Codd had introduced the idea of a “candidate key” but it was awhile before someone asked why should we have to anoint one of the candidates as the primary? The physical reason was the sequential file systems that early RDBMS products used under the cover. Our enemy, the punch card shows up again! But there was no mathematical reason and Dr. Codd later dropped the concept.

But SQL did not drop it and we still have it today. It is one of many legacies that punch cards have given us. At one point, ANSI X3H2 considered making it a requirement in SQL, but did not. This lets us declare a “staging table” inside a schema without a key. Instead, we picked up the UNIQUE constraint. The bad news is that UNIQUE is not well-understood.

Equivalence Relations, UNIQUEs and NULLs

The definition of a PRIMARY KEY constraint is that it is logically equivalent to “UNIQUE NOT NULL” on the column or columns involved. But a table can have only one PRIMARY KEY, as per our “punch card legacy” tradition. While a little redundant, you can write “NOT NULL PRIMARY KEY” for documentation and safety. If the primary key constraint is dropped, then the NOT NULL is left behind and still holds true.

In mathematics, an equivalence relation takes a set of values and partitions them into disjoint subsets or equivalence classes. The most common example is, of course, plain old vanilla equality. Each number is its own class. A bit fancier are the MOD(x, n) functions that put an infinite number of integers into classes whose smallest member is 0 thru (n-1).

SQL has “plain old vanilla equality”, but with a twist. NULLs do not compare under equality. Each NULL becomes its own class. By this time, you have done a query with a GROUP BY clause and see that all the NULLs are put into the same grouping. What is happening here? Grouping is not equality; it is another equivalence relation. Some parts of SQL use equality and use grouping. The current ANSI/ISO Standard added the verbose “IS [NOT] DISTINCT FROM” comparison operator to test for equal values or for a pair of NULLs — a test based on grouping rather than equality.

UNIQUE is based on grouping. Here is a quick test

Remember that while NULLs cannot compare to anything, even themselves, they do group together. Let’s change this around a bit and use a two column key. This means that the combination of the columns is unique, which is not quite the same.

NULL

1

2

1

1

1

2

1

Uniqueness is a handy thing, but its real power comes from Declarative Referential Integrity constraints. 

REFERENCES Constraints

Now for a quick lesson on data modeling. A strong entity is one that exists on its own merit. A weak entity is one that exists because of a strong entity. The classic example is that of a sales order; the order header is strong and the order details are weak. If the order is dropped, then all the order details should disappear. Referential integrity is a relational property that assures that weak entities (the referencing table) has a strong entity (the referenced table).

In the original SQL implementations, we had to use TRIGGERs to assure that referential integrity was maintained. This did not work so well, but it is what we had. This was the way we had always done it, in the world of punch cards and tape files. In fact, we could load a lot code into a trigger and have it fire when a “database event” (insertion, update or deletion) occurred. There is no way for an optimizer to read procedural code and do anything with it. Yet, when we looked at the way we were using them, it was for maintaining referential integrity in over 90% of the triggers.

When you forgot to write the triggers, the weak entities without a strong entity were called “orphans” and they could accumulate. This wasted space, obviously, but orphans have to be cleaned before you can use the data. An improperly designed schema can choke as it tries to get back its integrity.

By now, every SQL programmer has seen a REFERENCES constraint. The REFERENCES clause can be optimized, unlike the triggers! My analogy is that TRIGGERs were like controlling a vehicle with a different, independent steering device on every wheel, but the best way to drive is to lock the back wheels and synchronize the two front wheels with a single control. Less flexibility can mean more power and control; if it does 90% or more of your work, then life is good.

REFERENCES is under-used and two of its little-known features are:

  1. REFERENCES can refer to the same table as it appears. This is a little tricky, but useful. This is why we use the terms “referenced” and “referencing” tables. In the old days of network database we had “parent” and “child” files, which had to be different, used pointers that linked in only one direction and some other differences.
  2. REFERENCES can refer to a UNIQUE constraint, not just a PRIMARY KEY. The columns have to be “union compatible” (i.e. columns in both tables are of the same data types in the same positions in the list).

Perhaps some dummy code will help:

Now the weak entity table.

If I reference a table, it is assumed that I am using the primary key of a table by default. It is a better programming practice to explicitly state the column names. This example will look to changes in Sales_Orders and do updates or deletes to the Order_Details table. If I did not have these clauses, an UPDATE or DELETE might fail. When I try:

…it fails because of the PRIMARY KEY violation, But I can do this, without any trouble.

… and when I try to follow it with …

.. it fails the Declarative Referential Integrity  (DRI) check. You can update the strong entity, Sales_orders, and invisibly to you, the Order_Details will get the change to the order_nbr. By invisible, I mean that you do not get a message about the cascade.

This will also perform invisibly to you, as it flushed the matching Order_Details for that order_nbr.

Now let’s try the schema with UNIQUE and NULL-able columns.

While this would not make sense in this example (what color NULL do you want to buy?), you can load the Details table with:

And if you try this, the NULL is treated as a value would have been.

Since a PRIMARY KEY has to be NOT NULL by definition, this is a little weird. Among the options in the ON [DELETE | UPDATE] can set the referencing columns to NULL or DEFAULT.

This model of referential integrity is deliberately simple. We rejected some elaborations that would have used a “quorum” model (you must have at least (n) out of a list of (m) referencing tables) or special syntax for self-references. This does not mean that we cannot get fancy.

 In full ANSI/ISO Standard SQL, we have a CREATE ASSERTION statement which is a CHECK() constraint applied over the whole schema. Unlike a simple CHECK() clause, an assertion can reference multiple tables in its CHECK predicate.

We do not have this feature in T-SQL yet. The UNIQUE constraint is the only multiple table logical tool we have. Luckily, we can do a few things with it.

Multiple UNIQUEs

While you can only have one PRIMARY KEY, there is no limit on the number of UNIQUE constraints per table. Again, it is easier to show with an example than with a narrative. You are given a table of Students and table of Faculty, which have an associated email address each. The immediate thought is a skeleton like this:

But this allows an email to belong to both a student and a teacher. Let’s assume that we want to keep the emails disjoint. You might want put the emails into one table keyed on the email.

These four insertions work just fine

But when I try to add myself and my email as either a student or faculty member, I get failures.

But if I do these two insertions, we have no problems. I have no trouble becoming both a student and faculty member with the same email. Not what I meant to do.

Here is a second shot at it. First, we need some skeleton tables for the students and faculty; they are obvious entities. Emails are also clearly entities, too. We have decided that we have two types of emails, S = student and F = Faculty. If we decide that an email is required, then we can show the relationship of faculty to an email, and student to an email.

We need to load the email directory first.

The Students have a hard-wired email type of ‘S’ in its rows and the Faculty uses ‘F’ to keep the emails in two disjoint subsets.

Here is some insertions. The second fails from a bad email. When we correct it, it an be inserted.

This same pattern will hold for Faculty emails, so i will not show it here. In practice, you would probably create a view with the available email addresses for the type of user trying to get an account. For multiple emails, we just add another 1:M relationship table

Self-References

It is possible that the referenced table and the referencing tables can be the same. This is not how the “parent-child” mode from old network databases worked with pointer chain structures.

It is again easier to give an example than to talk theory. Let’s model a history of some generic events and we want to be sure that there are no gaps in the history. We will use the ISO half-open interval model of time. An on-going event uses a NULL as its ending time.

Create this table and try to add a new event. You will fail! The previous date comes into existence as part of the whole row. This is part of the set-oriented nature of SQL. I need to disable that constraint so I can put a “starter row” in the table. You can hide this in a stored procedure:

Overlapping Keys

One of the basic tricks in SQL is representing a many-to-many relationship. You create a third table that references the two (or more) tables involved by their primary keys. This third table has quite a few popular names taken from older network databases, such as “junction table” or “links” that are incorrect. It is a relationship. For example, here are two skeleton tables:

Yes, I know using names for a key is a bad practice, but doing so will make my examples easier to read. A relationship has a degree, which is the number of sets involved; participation by a set can be mandatory or optional; the cardinality can be one to one, one to many, or many to many. The relationship table is a table and therefore needs a key of its own.

The “Dance” table is a proper table, with a two-column key.

The only restriction on the rows is that a pairing appears only once. Now, I want to make a rule that guys can dance with as many gals as they want, but the gals have to stick to one guy. The way I do this is to use a NOT NULL UNIQUE constraint on the girl_name column, which makes it a key. It’s a simple key because it is only one column, but it is also a nested key because it appears as a subset of the compound PRIMARY KEY.

The ladies might want the same freedom and dance with a series of men:

The Dancegirls table would permit these rows from our original set:

If we want to stop them changing dancing partners, we can use three keys.

Think about this table for a minute. The PRIMARY KEY is now redundant. If each boy appears only once in the table and each girl appears only once in the table, then each (boy_name, girl_name) pair can appear only once. From a theoretical viewpoint, I could drop the compound key and make either boy_name or girl_name the new primary key, or I could just leave them as UNIQUE NOT NULL keys. From a practical viewpoint, each of these keys will be an index. You could be trading storage space for access speed, but you should be careful about this

SQL products and theory do not always match. Many products make the assumption that the PRIMARY KEY is in some way special in the data model and will be the way to access the table most of the time.

Many programmers are not aware that a FOREIGN KEY constraint can also reference any UNIQUE constraint in the same table or in another table. The following nightmare will give you an idea of the possibilities. The multiple column versions follow the same syntax.

My favorite example of a overlapping UNIQUE constraints is a teacher’s schedule kept in a table like this:

That choice of a primary key is the most obvious one — use all the columns. Typical rows would look like this:

The rules you want to enforce are:

  • A teacher is in only one room each period
  • A teacher teaches only one class each period
  • A room has only one class each period
  • A room has only one teacher in it each period.

Stop reading and see what you come up with for an answer.

You know that you have 24 ways to pick three objects from a set of four in an ordered sequence (permutation). If order does not matter, then you have a combination and only four subsets, all of which I have used in the UNIQUE constraints. Although column order is important in creating an index, you can ignore it for now and then worry about index tuning later.

I probably want to drop the first PRIMARY KEY as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY KEY and then one of the constraints?

I can now insert these rows in the second version of the table:

This gives me a very tough sixth period class load because I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:

Ms. Cannon and I are both in room 223, trying to teach different classes at the same time. I think you get the idea that a relationship table is not simple. But if I know a class_period and teacher_name, then I can find a unique room_nbr. The same logic holds for room_nbr and class_period to determine the class_name.

The good news / bad news is that the UNIQUE constraints will have to create an index. We will gain both the extra data integrity and improved access speed. We will lose storage space and slow down inserts, updates and deletes.