Declarative SQL: Using UNIQUE Constraints

In SQL, you can express the logic of what you want to accomplish without spelling out the details of how the database should do it. Nowhere is this more powerful than in constraints. SQL is declarative, and Joe Celko demonstrates, in his introduction to Declarative SQL, how you can write portable code that performs well and executes some complex logic, merely by creating unique constraints.

Edited: 21 December 2015

In this series, there are …

A Bit of History

Back in the days of punch cards and magnetic tapes, the only possible ‘file’ structure was a physically sequential one. Usually the records (not rows!) in a ‘file’ were sorted on a subset of the fields (not columns!). In RDBMS, tables (not ‘files’!) have no ordering. You fetch a row using a key and then access the columns within each row by name, not by a relative position in a record. RDBMS is much more abstract.

But let’s go back to the early days of SQL. Nobody had a product built for a relational database from scratch. We had to build on the existing ‘file’ systems, and the IBM products in particular. Sequential ‘files’ (punch cards, tapes or disk ‘files’) depend on knowing the sorted order of the ‘file’, they use it to locate records; a physical read/write head in the hardware has to be moved in a physical ‘file’ by counting records. Think about trying to do random access on a mag tape. The tape drives cannot spin fast enough for this approach, but if you know if a record is ahead or behind your read/write head’s current position you can make this work. I can merge two or more tapes easily. Even disk ‘files’, back then, were sorted so that we could use ISAM (Indexed Sequential Access Method).

The PRIMARY KEY construct came directly from the ‘file’ sort key. This keyword told the early SQL engines about how to lay the data on the physical storage. This is why original Sybase/SQL Server used a clustered index by default on a PRIMARY KEY . Welcome to the ISAM model!

Even Dr. Codd fell into this mindset when SQL was being created. He initially did not realize that a key is a key by its nature and none of them are ‘more key-ish’ than any other key. Then we added hashing and other access methods to disk system;. By the time that Dr. Codd had dropped the PRIMARY KEY from the Relational Model, SQL was already set in its ways.

To be a valid relational table, you must have a key. If the table has only one key, then we like to declare it as the PRIMARY KEY for documentation: But the underlying ‘file’ systems had no requirement for sorting or for UNIQUE ness of records. In fact, a lot of machine time back then was spent on sorting and dropping duplicates records from the ‘files’. So SQL followed its roots and allows an SQL table to be a pile (slang for an SQL table without a key, but unlike a simple flat ‘file’, it has strong typing and perhaps constraints).

A key is a subset of columns of a table such that it is UNIQUE for every row in the table. This is straight out of a textbook. But a UNIQUE value does not make a data element subset a key. To be a PRIMARY KEY , subset cannot have any NULL values. We’ll get into what happens with a UNIQUE constraint when there are NULLs in some of the columns.

What is NOT a Key

You cannot use an IDENTITY table property as a key. It is not a column. It is a count of the physical insertion attempts to one table, on one disc in one schema. Not even successful inserts! A rollback does not re-set the internal counter on that table. In English this is like trying to identify your car by giving the insurance company parking space number you happen to be using today. They want the VIN number instead. Why? Because a VIN identifies the automobile itself no matter where is it is in the Universe.

The GUID and UUID also do not work. The ‘G’ stands for global and the ‘U’ stands for universal. They are intended to locate things that are external to the schema, somewhere in cyberspace. They are not by definition, attributes inside the table or the schema.

Programmers who use GUID, UUID, identity or other auto increment machine generated UNIQUE values are usually faking the old pointer chains and record numbers they had in ‘file’ systems from the 1970s or the early network databases. I call these things exposed physical locators or fake pointers.

UNIQUE and PRIMARY KEY Constraints

There are some important differences between UNIQUE and PRIMARY KEY constraints. There can be only one PRIMARY KEY per table but many UNIQUE columns. A PRIMARY KEY is automatically declared to have a NOT NULL constraint on it, but a UNIQUE column can have a NULL in a column unless you explicitly add a NOT NULL constraint. Adding the NOT NULL whenever possible is a good idea, as it makes the column into a proper relational key. I also add NOT NULL to PRIMARY KEY to document the table and to be sure it stays there when the key changes.

‘File’ system programmers understand the concept of a PRIMARY KEY , but for the wrong reasons. Their mindset is a sequential ‘file’, which can have only one key because that key is used to determine the physical order of the records within the ‘file’. There is no ordering in a table; the term PRIMARY KEY in SQL has to do with defaults in referential actions, which we will discuss later.

Super Keys

A super key is a key with too many columns in it. Somewhere in it, you can find a proper subset of columns that is already a key. So why would you want to waste resources on such a thing? In theory, it is a bad idea, but in practice it can be handy. If your SQL engine allows you to include non-indexed columns in an index, but carry them in the leaf nodes of the tree, you can get what is called a “covering index” for the queries. Instead of having to get the non-indexing columns from the base table, you already have it covered.

For example, we know that the model year of something is not going to be a key and serial numbers are keys. But if we almost always want to have the model year for other thing, we might use:

This has to be done with an index and not a UNIQUE constraint, which does not have the extra syntax. Do not confuse this with a multiple-column key.

NULLs in UNIQUE Constraints

SQL has two kinds of equivalence operation.

The first is a version of the usual equal (=) for numeric, string and temporal values. Well, not quite the usual operator because all data types are defined as NULL-able in SQL. The rule is that any theta operator (=, >, <, =>, =<., etc) returns UNKNOWN when applied to a NULL. This is why a NULL is not equal to anything, even itself.

The second equivalence operator is grouping, as in GROUP BY. We debated this in the original X3H2 Database Standards; should each NULL form its own group or not? We decided to put all the NULLs into one group. This principle appears is several places in SQL constructs, including the UNIQUE constraint.

Multiple-Column Keys

There is also a multiple-column form of the <UNIQUE specification> <column list>; it means that the combination of those columns is UNIQUE . It says nothing about each column separately. For the math majors, we can write some rules. Notice which way the implications go.

The multiple-column uniqueness has some basic flavors. Here is classification of them:

Co-ordinates

The most familiar coordinate system are (longitude, latitude) geographical data or spatial data with (x, y) or (x, y, z) grids. The defining characteristic of a coordinate system is that you need to have all the columns in subset to get any information. Each column is independent of the others.

Please do not think that the dimensions have to be spatial. In fact, any physical entity has to exist in time, as well as space.

Trees

The most familiar trees system are organizational charts, parts explosions and hierarchies The defining characteristic of a tree system is that you do not need to have all the columns in subset to get information. Consider the pair (branch_office , department_name ) as the key of a table. So we can have (‘Chicago’, ‘Accounting’), (‘Atlanta’, ‘Accounting’), (‘Chicago’, ‘HR’), etc. The branch_office column is information about where the company does business in itself.

Quorums

This is a weird situation. Given a set of (n) columns, we can make a key from a subset of any (k < n) columns. This is not quite the same as a super key, which has key and non-key members in its subset. There is a famous programming problem called the eight queens problem. Given a chessboard and eight queens, you are supposed to place the queens on the board in such a way that none of them capture any other queen. This means that each “board_rank ‘ (a thru h) has one and only one queen, and each ‘file’ (1 thru 8) also has one and only one queen. To locate a particular, I can use the usual (“board_rank “, ‘file’) pair, but because of the restrictions we could use (“board_rank ‘) or “board_file “) as a key.

One solution for the Eight Queens:

Any of the following statements will remove the same row. I only need 1 of 3 columns to locate a unique row.

Once you have the row out, you can try to replace it with these statements and results:

The constraints allow only one value for the board_file in this simple table. But the insertion cannot figure out missing value for “board_file ” automatically, nor can it be done in more general cases.

Reducing Redundant & Incomplete Keys

My favorite example is a teacher’s schedule kept in a skeleton table like this:

That choice of a primary key is the most obvious one — use all the columns. The uniqueness rules we want to enforce are simple to understand.

  1. A teacher is in only one room each period.
  2. A teacher teaches only one class each period.
  3. A room has only one class each period.
  4. A room has only one teacher in it each period.

A sample row might look like this. You get senior citizen bonus points if you remember “Room 222” from the last 1960’s.

Stop reading and see what you come up with for an answer. Okay, now consider using one constraint for each rule in the list, thus.

We know that there are four ways to pick three things from a set of four things; it is called a combination.

I could drop the PRIMARY KEY as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY KEY and then one of the constraints?

I can now insert these rows in the second version of the table:

This gives me a very tough sixth period teaching load since I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:

Ms. Shields and I are both in room 223, trying to teach different classes at the same time. The trick is to combine the rules

If a teacher is in only one room each period, then given a period and a teacher I should be able to determine only one room, i.e. room is functionally dependent upon the combination of teacher and period. Likewise, if a teacher teaches only one class each period, then class is functionally dependent upon the combination of teacher and period. The same thinking holds for the last two rules: class is functionally dependent upon the combination of room and period, and teacher is functionally dependent upon the combination of room and period.

With the constraints that were provided in the first version, you will find that the rules are not enforced. For example, I could enter the following rows:

These rows violate rule #1 and rule #2.

However, the UNIQUE constraints first provided in Schedule_2 do not capture this violation and will allow the rows to be entered.

The constraint …

… is checking the complete combination of teacher, room, and period, and since (‘Mr. Celko’, 222, 6) is different from (‘Mr. Celko’, 223, 6), the DDL does not find any problem with both rows being entered, even though that means that Mr. Celko is in more than one room during the same period.

does not catch its associated rule either since (‘Mr. Celko’, ‘Database 101’, 6) is different from (‘Mr. Celko’, ‘Database 102’, 6), and so, Mr. Celko is able to teach more than one class during the same period, thus violating rule two. It seems that we would also be able to add the following row:

… which violates rules #3 and #4.

Try to imagine the task of enforcing this with procedural code. This is why I say that most of the work in SQL is done in the DDL.

Conclusions

UNIQUE constraints can provide a great deal of conditional logic to maintain rules. These patterns are completely portable across relational databases: The indexes they create can be used by the optimizer to improve performance. But the real advantage is that they eliminate the need for procedural code in the database and the application layers.

In the second part of this article, we will discuss using the REFERENCES clause to further eliminate the need for procedural code in the database and the application layers.