Edited: 21 December 2015
In this series, there are …
- 1. Declarative SQL: Using UNIQUE Constraints
- 2. Declarative SQL: Using References
- 3. Declarative SQL: Using CHECK() & DEFAULT
A Bit of History
Back in the days of punch cards and magnetic tapes, the only possible ‘file’ structure was a physically sequential one. Usually the records (not rows!) in a ‘file’ were sorted on a subset of the fields (not columns!). In RDBMS, tables (not ‘files’!) have no ordering. You fetch a row using a key and then access the columns within each row by name, not by a relative position in a record. RDBMS is much more abstract.
But let’s go back to the early days of SQL. Nobody had a product built for a relational database from scratch. We had to build on the existing ‘file’ systems, and the IBM products in particular. Sequential ‘files’ (punch cards, tapes or disk ‘files’) depend on knowing the sorted order of the ‘file’, they use it to locate records; a physical read/write head in the hardware has to be moved in a physical ‘file’ by counting records. Think about trying to do random access on a mag tape. The tape drives cannot spin fast enough for this approach, but if you know if a record is ahead or behind your read/write head’s current position you can make this work. I can merge two or more tapes easily. Even disk ‘files’, back then, were sorted so that we could use ISAM (Indexed Sequential Access Method).
The PRIMARY KEY
construct came directly from the ‘file’ sort key. This keyword told the early SQL engines about how to lay the data on the physical storage. This is why original Sybase/SQL Server used a clustered index by default on a PRIMARY KEY
. Welcome to the ISAM model!
Even Dr. Codd fell into this mindset when SQL was being created. He initially did not realize that a key is a key by its nature and none of them are ‘more key-ish’ than any other key. Then we added hashing and other access methods to disk system;. By the time that Dr. Codd had dropped the PRIMARY KEY
from the Relational Model, SQL was already set in its ways.
To be a valid relational table, you must have a key. If the table has only one key, then we like to declare it as the PRIMARY KEY
for documentation: But the underlying ‘file’ systems had no requirement for sorting or for UNIQUE
ness of records. In fact, a lot of machine time back then was spent on sorting and dropping duplicates records from the ‘files’. So SQL followed its roots and allows an SQL table to be a pile (slang for an SQL table without a key, but unlike a simple flat ‘file’, it has strong typing and perhaps constraints).
A key is a subset of columns of a table such that it is UNIQUE
for every row in the table. This is straight out of a textbook. But a UNIQUE
value does not make a data element subset a key. To be a PRIMARY KEY
, subset cannot have any NULL values. We’ll get into what happens with a UNIQUE
constraint when there are NULLs in some of the columns.
What is NOT a Key
You cannot use an IDENTITY table property as a key. It is not a column. It is a count of the physical insertion attempts to one table, on one disc in one schema. Not even successful inserts! A rollback does not re-set the internal counter on that table. In English this is like trying to identify your car by giving the insurance company parking space number you happen to be using today. They want the VIN number instead. Why? Because a VIN identifies the automobile itself no matter where is it is in the Universe.
The GUID and UUID also do not work. The ‘G’ stands for global and the ‘U’ stands for universal. They are intended to locate things that are external to the schema, somewhere in cyberspace. They are not by definition, attributes inside the table or the schema.
Programmers who use GUID, UUID, identity or other auto increment machine generated UNIQUE
values are usually faking the old pointer chains and record numbers they had in ‘file’ systems from the 1970s or the early network databases. I call these things exposed physical locators or fake pointers.
UNIQUE and PRIMARY KEY Constraints
There are some important differences between UNIQUE
and PRIMARY KEY
constraints. There can be only one PRIMARY KEY
per table but many UNIQUE
columns. A PRIMARY KEY
is automatically declared to have a NOT NULL
constraint on it, but a UNIQUE
column can have a NULL in a column unless you explicitly add a NOT NULL
constraint. Adding the NOT NULL
whenever possible is a good idea, as it makes the column into a proper relational key. I also add NOT NULL
to PRIMARY KEY
to document the table and to be sure it stays there when the key changes.
‘File’ system programmers understand the concept of a PRIMARY KEY
, but for the wrong reasons. Their mindset is a sequential ‘file’, which can have only one key because that key is used to determine the physical order of the records within the ‘file’. There is no ordering in a table; the term PRIMARY KEY
in SQL has to do with defaults in referential actions, which we will discuss later.
Super Keys
A super key is a key with too many columns in it. Somewhere in it, you can find a proper subset of columns that is already a key. So why would you want to waste resources on such a thing? In theory, it is a bad idea, but in practice it can be handy. If your SQL engine allows you to include non-indexed columns in an index, but carry them in the leaf nodes of the tree, you can get what is called a “covering index” for the queries. Instead of having to get the non-indexing columns from the base table, you already have it covered.
For example, we know that the model year of something is not going to be a key and serial numbers are keys. But if we almost always want to have the model year for other thing, we might use:
1 2 |
CREATE UNIQUE NONCLUSTERED INDEX Serial_With_Year_Index (serial_nbr) INCLUDE (model_year); |
This has to be done with an index and not a UNIQUE
constraint, which does not have the extra syntax. Do not confuse this with a multiple-column key.
NULLs in UNIQUE Constraints
SQL has two kinds of equivalence operation.
The first is a version of the usual equal (=) for numeric, string and temporal values. Well, not quite the usual operator because all data types are defined as NULL-able in SQL. The rule is that any theta operator (=, >, <, =>, =<., etc) returns UNKNOWN when applied to a NULL. This is why a NULL is not equal to anything, even itself.
The second equivalence operator is grouping, as in GROUP BY. We debated this in the original X3H2 Database Standards; should each NULL form its own group or not? We decided to put all the NULLs into one group. This principle appears is several places in SQL constructs, including the UNIQUE
constraint.
Multiple-Column Keys
There is also a multiple-column form of the <UNIQUE
specification> <column list>; it means that the combination of those columns is UNIQUE
. It says nothing about each column separately. For the math majors, we can write some rules. Notice which way the implications go.
1 2 3 |
UNIQUE(y, x) ⇒ UNIQUE(x, y) --- symmetry UNIQUE(x) ∧ (y) ⇒ UNIQUE(x, y) ---super key rule UNIQUE(x) ∧ UNIQUE(y) ⇒ UNIQUE(x, y) --- all key rule |
The multiple-column uniqueness has some basic flavors. Here is classification of them:
Co-ordinates
The most familiar coordinate system are (longitude, latitude) geographical data or spatial data with (x, y) or (x, y, z) grids. The defining characteristic of a coordinate system is that you need to have all the columns in subset to get any information. Each column is independent of the others.
Please do not think that the dimensions have to be spatial. In fact, any physical entity has to exist in time, as well as space.
Trees
The most familiar trees system are organizational charts, parts explosions and hierarchies The defining characteristic of a tree system is that you do not need to have all the columns in subset to get information. Consider the pair (branch_office
, department_name
) as the key of a table. So we can have (‘Chicago’, ‘Accounting’), (‘Atlanta’, ‘Accounting’), (‘Chicago’, ‘HR’), etc. The branch_office
column is information about where the company does business in itself.
Quorums
This is a weird situation. Given a set of (n) columns, we can make a key from a subset of any (k < n) columns. This is not quite the same as a super key, which has key and non-key members in its subset. There is a famous programming problem called the eight queens problem. Given a chessboard and eight queens, you are supposed to place the queens on the board in such a way that none of them capture any other queen. This means that each “board_rank
‘ (a thru h) has one and only one queen, and each ‘file’ (1 thru 8) also has one and only one queen. To locate a particular, I can use the usual (“board_rank
“, ‘file’) pair, but because of the restrictions we could use (“board_rank
‘) or “board_file
“) as a key.
1 2 3 4 5 6 7 8 9 |
CREATE TABLE Chessboard (queen_name CHAR(2) NOT NULL PRIMARY KEY CHECK (queen_name LIKE 'Q[1-8]'), board_rank CHAR(1) CHECK (board_rank BETWEEN 'a' AND 'h'), UNIQUE (board_rank), board_file CHAR(1) CHECK (board_file BETWEEN '1' AND '8'), UNIQUE (board_file), CHECK (board_file IS NOT NULL OR board_rank IS NOT NULL) ); |
One solution for the Eight Queens:
1 2 3 4 5 6 7 8 9 10 |
INSERT INTO Chessboard VALUES ('q1', 'a', '1'), ('q2', 'b', '3'), ('q3', 'c', '5'), ('q4', 'd', '7'), ('q5', 'e', '2'), ('q6', 'f', '8'), ('q7', 'g', '6'), ('q8', 'h', '4'); |
Any of the following statements will remove the same row. I only need 1 of 3 columns to locate a unique row.
1 2 3 |
DELETE FROM Chessboard WHERE queen_name = 'q1'; DELETE FROM Chessboard WHERE board_rank = 'a'; DELETE FROM Chessboard WHERE board_file = '1'; |
Once you have the row out, you can try to replace it with these statements and results:
1 2 3 4 |
INSERT INTO Chessboard (queen_name) VALUES ('q1'); --fails with two nulls INSERT INTO Chessboard (queen_name, board_file) VALUES ('q1', '2'); -- fails on board_file INSERT INTO Chessboard (queen_name, board_file, board_rank) VALUES ('q1', NULL, 'b'); -- fails on board_rank INSERT INTO Chessboard (queen_name, board_file, board_rank) VALUES ('q1', NULL, 'a'); -- works with null & uniqueness on other keys |
The constraints allow only one value for the board_file
in this simple table. But the insertion cannot figure out missing value for “board_file
” automatically, nor can it be done in more general cases.
Reducing Redundant & Incomplete Keys
My favorite example is a teacher’s schedule kept in a skeleton table like this:
1 2 3 4 5 6 |
CREATE TABLE Schedule -- skeleton table. WRONG! (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, PRIMARY KEY (teacher_name, class_title, room_nbr, period_nbr)); |
That choice of a primary key is the most obvious one — use all the columns. The uniqueness rules we want to enforce are simple to understand.
- A teacher is in only one room each period.
- A teacher teaches only one class each period.
- A room has only one class each period.
- A room has only one teacher in it each period.
A sample row might look like this. You get senior citizen bonus points if you remember “Room 222” from the last 1960’s.
1 |
('Mr. Celko', 'Database 101', 222, 6) |
Stop reading and see what you come up with for an answer. Okay, now consider using one constraint for each rule in the list, thus.
1 2 3 4 5 6 7 8 9 10 |
CREATE TABLE Schedule_1 -- version one, still WRONG! (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #1 UNIQUE (teacher_name, class_title, period_nbr), -- rule #2 UNIQUE (class_title, room_nbr, period_nbr), -- rule #3 UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #4 PRIMARY KEY (teacher_name, class_title, room_nbr, period_nbr)); |
We know that there are four ways to pick three things from a set of four things; it is called a combination.
I could drop the PRIMARY KEY
as redundant if I have all four of these constraints in place. But what happens if I drop the PRIMARY KEY
and then one of the constraints?
1 2 3 4 5 6 7 8 |
CREATE TABLE Schedule_2 -- still wrong (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, UNIQUE (teacher_name, room_nbr, period_nbr), -- rule #1 UNIQUE (teacher_name, class_title, period_nbr), -- rule #2 UNIQUE (class_title, room_nbr, period_nbr)); -- rule #3 |
I can now insert these rows in the second version of the table:
1 2 |
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6) |
This gives me a very tough sixth period teaching load since I have to be in two different rooms at the same time. Things can get even worse when another teacher is added to the schedule:
1 2 3 |
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6) ('Ms. Shields', 'Database 101', 223, 6) |
Ms. Shields and I are both in room 223, trying to teach different classes at the same time. The trick is to combine the rules
1 2 3 4 5 6 7 |
CREATE TABLE Schedule_3 -- corrected version (teacher_name VARCHAR(15) NOT NULL, class_title CHAR(15) NOT NULL, room_nbr INTEGER NOT NULL, period_nbr INTEGER NOT NULL, UNIQUE (teacher_name, period_nbr), -- rules #1 and #2 UNIQUE (room_nbr, period_nbr)); -- rules #3 and #4 |
If a teacher is in only one room each period, then given a period and a teacher I should be able to determine only one room, i.e. room is functionally dependent upon the combination of teacher and period. Likewise, if a teacher teaches only one class each period, then class is functionally dependent upon the combination of teacher and period. The same thinking holds for the last two rules: class is functionally dependent upon the combination of room and period, and teacher is functionally dependent upon the combination of room and period.
With the constraints that were provided in the first version, you will find that the rules are not enforced. For example, I could enter the following rows:
1 2 |
('Mr. Celko', 'Database 101', 222, 6) ('Mr. Celko', 'Database 102', 223, 6) |
These rows violate rule #1 and rule #2.
However, the UNIQUE
constraints first provided in Schedule_2 do not capture this violation and will allow the rows to be entered.
The constraint …
1 |
UNIQUE (teacher_name, room_nbr, period_nbr) |
… is checking the complete combination of teacher, room, and period, and since (‘Mr. Celko’, 222, 6) is different from (‘Mr. Celko’, 223, 6), the DDL does not find any problem with both rows being entered, even though that means that Mr. Celko is in more than one room during the same period.
1 |
UNIQUE (teacher_name, class_title, period_nbr) |
does not catch its associated rule either since (‘Mr. Celko’, ‘Database 101’, 6) is different from (‘Mr. Celko’, ‘Database 102’, 6), and so, Mr. Celko is able to teach more than one class during the same period, thus violating rule two. It seems that we would also be able to add the following row:
1 |
('Ms. Shields', 'Database 103', 222, 6) |
… which violates rules #3 and #4.
Try to imagine the task of enforcing this with procedural code. This is why I say that most of the work in SQL is done in the DDL.
Conclusions
UNIQUE
constraints can provide a great deal of conditional logic to maintain rules. These patterns are completely portable across relational databases: The indexes they create can be used by the optimizer to improve performance. But the real advantage is that they eliminate the need for procedural code in the database and the application layers.
In the second part of this article, we will discuss using the REFERENCES clause to further eliminate the need for procedural code in the database and the application layers.
Load comments