If you look at the postings on SQL forums, you will notice that people are sloppy. They expect that the reader can read their minds and figure out everything that’s not been posted. They seldom post DDL, and when they do, it is also wrong. I don’t mean a little bit wrong. I mean fundamentally wrong. RDBMS has the concepts of (Relational) keys, so DDL without at least one key means you don’t have a table at all. If all the columns are NULL
-able, they cannot ever be part of any key. This non-table is really a deck of 1960s punch cards hidden as an SQL table.
Dr. Codd defined a key as being a subset of columns in a table. A column is defined as an attribute of the entity that the table models. The IDENTITY
table property is left over from sequential file days is not an attribute of an entity; it is a metadata attribute of the table. The GUID suffers the same problem of not being a proper attribute of the entities within the table in which they appear. Yes, they are UNIQUE
, but they’re not attributes. Just sticking the constraint of PRIMARY
KEY
on them doesn’t avoid the fundamental problem. Novices forget that the “G” in GUID
stands for “global,” and they are not local. Such attributes are specifically not in the table! These things cannot be keys by definition.
What is a key?
The definition of the key:
1) A key must be a non-empty subset of columns in the table (attributes of the entity). This means it cannot be an external pointer or physical locator to rows in the table. As an analogy, when you identify an automobile, you don’t use a parking space number in some particular garage; you use the VIN.
2) A key must be unique within the table. This is necessary but not sufficient. In fact, the best keys are unique across the whole schema and then some. This is another reason to use VIN for automobiles; it’s useful to the DMV, your insurance company, and any other use you should have for that automobile. It can be validated by using a really ugly regular expression or verified by physically going to your car and looking at the chassis or the etchings on your windshield. Validation and verification of data is a slightly different topic, so I’ll just skip over them for now.
3) A key must be precise, not approximate or vague. You don’t think about it, but this automatically eliminates approximate numeric types as keys. Two floating-point numbers are considered equal if they differ by a small value called the epsilon. This means it two different floating-point numbers can be treated as equal. This need for precision is also the reason that NULLs
are not allowed in a key. Remember that a NULL
is not a value. I’ll get into some of the problems with NULLs
shortly.
A bit of history
When Dr. Codd created the relational model, data was still in file systems and a few early networks and hierarchical databases. Our mindset was still stuck in these pre-relational tools, so the good doctor came up with the idea that every table should have a PRIMARY
KEY
. It was defined as a key that we anointed as somehow special. A table could only have one of these (Wow! Just like the sort order in sequential files!).
In the early days of the SQL Standards, we considered requiring a PRIMARY
KEY
on a table but decided against it. The idea was that legacy data without keys (think mag tapes and punch cards) would be easier to move into SQL products without that constraint.
Shortly after that, Dr. Codd realized that a key is a key. There was no need to mark one as special. This is what led to the UNIQUE
constraint.
UNIQUE and PRIMARY KEY constraints
The UNIQUE
and PRIMARY
KEY
constraints say that no duplicate values are allowed in the column. It comes in two forms. The syntax is:
1 |
<UNIQUE specification> ::= UNIQUE | PRIMARY KEY |
There are some subtle differences between UNIQUE
and PRIMARY
KEY. There still can be only one PRIMARY
KEY
per table but many UNIQUE
columns. A PRIMARY
KEY
is automatically declared to have a NOT
NULL
constraint on it, but a UNIQUE
column can have a single NULL
in it unless you explicitly add a NOT
NULL
constraint. Adding the NOT
NULL
whenever possible is a good idea, as it makes the column into a proper relational key. I also add NOT
NULL
to PRIMARY
KEY
declarations to document the table and to be sure it stays there when the key changes.
File system programmers understand the concept of a PRIMARY
KEY, but for the wrong reasons. Their mindset is a sequential file, which can have only one key, because that key is used to determine the physical sort order of the records within the file. There is no ordering in a table; the term PRIMARY
KEY
in SQL has to do with defaults in referential actions, which is another topic.
A single NULL
is allowed in the UNIQUE
constraint to be able to match a NULL
as if it were a value. The same type of logic is applied to the GROUP
BY
clause so that each NULL
doesn’t become its own group. There’s a cute story about this decision that has to do with traffic tickets. Before SQL, someone got a prestige tag reading missing and was inundated by hundreds of traffic tickets because that’s what the police officers wrote for the missing auto tag everywhere in California.
Multi-column UNIQUE constraints
There is also a multiple-column form of the <UNIQUE specification>
, which is usually written at the end of the column declarations. It is a list of columns in parentheses after the appropriate keyword; it means that the combination of those columns is unique.
Broadly speaking, there are two patterns for multi-column keys. One pattern is like coordinates, where each attribute is strong and has meaning in its own right. For example, I might declare PRIMARY
KEY
(city, department) so I can be sure that although I have offices in many cities and many identical departments in those offices, there is only one personnel department in Chicago.
The other pattern is hierarchical: one attribute is subordinate to another attribute. A classic example is an order number and an order detail number subordinated to that particular order. This has to do with weak and strong entities, and that’s another topic.
Super keys
A super key is a key that is too big. It has more columns in it than are needed to identify an entity in a table. In practice, you generally don’t like these things. The indexes or hash tables or whatever your SQL engine uses to implement uniqueness will have to do extra work carrying the redundant attributes. The problem is it isn’t always obvious which attributes are redundant. Imagine a super key UNIQUE
(a, b, c, d)
In which any three of the four columns is also unique. Which possible combination(s) do you want to keep?
Overlapping keys
You can also have overlapping keys, as in UNIQUE (a, b, c)
and UNIQUE (b, c, d)
, which share columns (b, c)
.
Nested unique constraints
One of the basic tricks in SQL is representing a one-to-one or many-to-many relationship with a table that references the two (or more) entity tables related by their primary keys. This third table has several popular names such as “junction table,” “Associative Entity,” or “join table,” but we know that it is a relationship. The term “junction table” is a pointer structure from network databases, not part of an RDBMS. For example, given two tables for a Cafeteria database:
1 2 3 4 5 6 |
CREATE TABLE Entrees (entree_name VARCHAR(30) NOT NULL PRIMARY KEY ...); CREATE TABLE Sidedishes (sidedish_name VARCHAR(30) NOT NULL PRIMARY KEY, ... ); |
Yes, I know using names for a key is a bad practice, but it will make my examples easier to read. The simplest relationship table looks like this:
1 2 3 4 5 |
CREATE TABLE Meals (entree_name VARCHAR(30) NOT NULL REFERENCES Entrees (entree_name), sidedish_name VARCHAR(30) NOT NULL, REFERENCES Sidedishes(sidedish_name)); |
The Meals table allows creating simple one entree/one side dish meals insert rows like this:
1 2 3 4 5 6 |
INSERT INTO Meals VALUES ('Meat Loaf', 'Mashed Potatoes'), ('Meat Loaf', 'Green Beans'), ('Fried Chicken', 'Green Beans'), ('Meat Loaf', 'Mashed Potatoes'); |
Oops! ‘Meat Loaf’ is shown twice with ‘Mashed Potatoes’ because the Meals table does not have its own key. This is an easy mistake to make, but fixing it is not obvious.
1 2 3 4 5 6 |
CREATE TABLE Menu (entree_name VARCHAR(30) NOT NULL REFERENCES Entrees (entree_name), sidedish_name VARCHAR(30) NOT NULL, REFERENCES Sidedishes(sidedish_name), PRIMARY KEY (entree_name, sidedish_name)); -- compound key |
The Menu table gets rid of the duplicated rows and makes this a proper table. The PRIMARY
KEY
for the table is made up of two or more columns and is called a compound key because of that fact. These are valid rows now.
1 2 3 |
('Meat Loaf', 'Mashed Potatoes') ('Meat Loaf', 'Green Beans') ('Fried Chicken', 'Green Beans') |
But the only restriction on the Meals is that they appear only once. Every entree can be paired with every side dish. I want a rule that entrees can have as many side dishes as they want, but the side dishes have to stick to one entree.
The way I do this is to use a NOT
NULL
UNIQUE
constraint on the sidedish_name column
, which makes it a key. It is a simple key since it is only one column, but it is also a nested key because it appears as a subset of the compound PRIMARY
KEY
.
1 2 3 4 5 6 |
CREATE TABLE Plate_Combos (entree_name VARCHAR(30) NOT NULL REFERENCES Entrees (entree_name), sidedish_name VARCHAR(30) NOT NULL UNIQUE, -- nested key REFERENCES Sidedishes(sidedish_name), PRIMARY KEY (entree_name, sidedish_name)); -- compound key |
Plate_Combos is a proper table, without duplicated rows, but it also enforces the condition that each entree can have one or more side dishes.
1 2 |
('Meat Loaf', 'Mashed Potatoes') ('Meat Loaf', 'Green Beans') |
Instead, you might want the side dishes to be combined with more than one entree.
1 2 3 4 5 6 |
CREATE TABLE Single_Side_Plates (entree_name VARCHAR(30) NOT NULL UNIQUE -- nested key REFERENCES Entrees (entree_name), sidedish_name VARCHAR(30) NOT NULL, REFERENCES Sidedishes(sidedish_name), PRIMARY KEY (entree_name, sidedish_name)); -- compound key |
The Single_Side_Plates table would permit these rows from the original set.
1 2 |
('Meat Loaf', 'Green Beans') ('Fried Chicken', 'Green Beans') |
Think about all of these possible keys for a minute. The compound PRIMARY
KEY
is now redundant. If each entree appears only once in the table or each side dish appears only once in the table, then each (entree_name
, sidedish_name
) pair can appear only once. A rather dull option of a single entree with a single side dish can be enforced with this DDL:
1 2 3 4 5 6 |
CREATE TABLE Lunch_Plates (entree_name VARCHAR(30) NOT NULL UNIQUE -- nested key REFERENCES Entrees (entree_name), sidedish_name VARCHAR(30) NOT NULL UNIQUE -- nested key, REFERENCES Sidedishes(sidedish_name), PRIMARY KEY(entree_name, sidedish_name)); -- redundant compound key |
The Meals table allows inserting these rows from the original set.
1 2 |
('Meat Loaf', 'Mashed Potatoes') ('Fried Chicken', 'Green Beans') |
Making special provisions for the PRIMARY
KEY
in the SQL engine is not a bad assumption because the REFERENCES
clause uses the PRIMARY
KEY
of the referenced table as the default. Many new SQL programmers are unaware that a FOREIGN
KEY
constraint can also reference any UNIQUE
constraint in the same table or another table. Such things can get very complicated and require that you defer constraints. Again, that’s another topic.
Overlapping keys
Getting back to the nested keys, just how far can you go with them? My favorite example is a teacher’s schedule kept in a table like this (I am leaving off REFERENCES
clauses and CHECK()
constraints):
1 2 3 4 5 6 |
CREATE TABLE Class_Schedule -- skeleton table. WRONG! (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, PRIMARY KEY (teacher_name, class_title, room_nbr, period_nbr)); |
That choice of a PRIMARY
KEY
is the most obvious one — use all the columns. Typical rows would look like this:
1 |
('Mr. Celko', 'Database 101', 222, 6) |
The rules to be enforced are:
1) A teacher is in only one room each period.
2) A teacher teaches only one class each period.
3) A room has only one class each period.
4) A room has only one teacher in it each period.
Stop reading and see what you come up with for an answer. Okay, now consider using one constraint for each rule in the list.
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE Class_Schedule_1 -- version one, still WRONG! (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #1 UNIQUE (teacher_name, class_title, period_nbr), -- rule #2 UNIQUE (class_title, room_nbr, period_nbr), -- rule #3 UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #4 PRIMARY KEY (teacher_name, class_title, room_nbr, period_nbr)); |
There are four ways to pick three things from a set of four things; it is called a combination. I could drop the PRIMARY
KEY
as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY
KEY
and then one of the constraints?
1 2 3 4 5 6 7 8 |
CREATE TABLE Class_Schedule_2 -- still wrong (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #1 UNIQUE (teacher_name, class_title, period_nbr), -- rule #2 UNIQUE (class_title, room_nbr, period_nbr)); -- rule #3 |
I can now insert these rows in the second version of the table:
1 2 |
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6) |
This gives me a very tough sixth-period teaching load since I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:
1 2 3 |
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6) ('Ms. Shields', 'Database 101', 223, 6) |
Ms. Shields and I are both in room 223, trying to teach different classes at the same time. Matthew Burr looked at the constraints and rules and came up with this analysis.
1 2 3 4 5 6 7 |
CREATE TABLE Class_Schedule_3 -- corrected version (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, UNIQUE (teacher_name, period_nbr), -- rules #1 and #2 UNIQUE (room_nbr, period_nbr)); -- rules #3 and #4 |
If a teacher is in only one room each period, then given a period and a teacher, I should be able to determine only one room, i.e., the room is functionally dependent upon the combination of teacher and period. Likewise, if a teacher teaches only one class each period, then class is functionally dependent upon the combination of teacher and period. The same thinking holds for the last two rules: class is functionally dependent upon the combination of room and period, and teacher is functionally dependent upon the combination of room and period.
With the constraints provided in the first version, you will find that the rules are not enforced. For example, I could enter the following rows:
1 2 |
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6) |
These rows violate rule #1 and rule #2.
However, the UNIQUE
constraints first provided in Class_Schedule_2 do not capture this violation and will allow the rows to be entered.
The constraint
1 |
UNIQUE (teacher_name, room_nbr, period_nbr) |
is checking the complete combination of teacher, room, and period. Since (‘Mr. Celko‘, 222, 6) is different from (‘Mr. Celko‘, 223, 6), the DDL does not find any problem with both rows being entered, even though that means that Mr. Celko is in more than one room during the same period.
1 |
UNIQUE (teacher_name, class_title, period_nbr) |
doesn’t catch its associated rule either since (‘Mr. Celko‘, ‘Database 101‘, 6) is different from (‘Mr. Celko‘, ‘Database 102‘, 6). Mr. Celko can teach more than one class during the same period, thus violating rule two. It seems that it’s also possible to add the following row:
1 |
('Ms. Shields', 'Database 103', 222, 6) |
which violates rules #3 and #4.
Conclusion
Try to imagine enforcing this with procedural code. This is why I say that most of the work in SQL is done in the DDL. You will not always need complicated, nested, overlapping uniqueness constraints, but they can be incredibly powerful when you do.
Load comments