If you’ve ever had a traditional logic course, you’ll have run into “The Law of Identity” as the founding principle of all Western thought. It says that:
“To be is to be something in particular; to be nothing in particular or anything in general, is to be nothing at all.”
This usually gets expressed as “A = A” thanks to Ayn Rand’s Objectivism and credited to Aristotle. You want to dig a little deeper, however, it actually predates Aristotle’s laws of logic.
More to the point, this is why we have keys in the relational data model and why those keys give an identity to each row in a table which models an entity in a set.
Later Gottfried Leibniz, whom you might remember as the “other guy” who invented calculus, held that Entity x is the same as Entity y if and only if every predicate true of x is true of y as well. What it means for practical purposes for us RDBMS people is that you should not have two or more tables with exactly the same structure. One set of entities should have one and only one table to model it.
This can be a little more subtle than you might think. There is a design flaw called attribute splitting, in which has defined extra tables based on the values of an attribute. For example, you would never define your primary human resources table based on the type of phone they use: iPhone_Using_Persons
and another table for Android_Using_Persons
. The type of phone an employee uses is clearly an attribute and this design have splits what should be one table into two based on the value of an attribute, and not on attribute itself. And then what if another phone type emerges? New table?
There are caveats of course. You might need a table such as this to record specific attributes of that person, but that is a different topic. It would not be the main Persons table.
Concept of Keys in RDBMS
In RDBMS, we establish uniqueness of the rows in the table by using a key. A key is a subset of the columns of that table which is unique in that table. Please note that by definition, a key cannot be a table property, like IDENTITY
or a GUID
. A key must be made up of attributes, which means actual table columns.
IDENTITY
is a table property, much like the parking space number in a garage. It is not an attribute of the entity being modeled by the table (like the VIN number burned into the engine block), but a value derived from the physical storage of the data. Trust me, no automobile, employee, piece of merchandise, or anything else you’re trying to represent with the table came with some IDENTITY
tattooed on it or as an innate attribute. Just different parking garages can have spaces with the same numbers, so totally unrelated tables (even in different schemas!) can have the same value in an IDENTITY
column.
There’s only one IDENTITY
allowed per a table. If it were an attribute, it might be possible for an entity to have more than one of that same kind of attribute. For example, a table that models automobile trips can have ZIP codes for both the departure and arrival locations. Attributes also don’t have to be all the same data type, but the IDENTITY
column is always a whole number (typically using an integer type of some sort, but can also use a numeric type.) The Microsoft implementation is also issued sequentially. This sequence is based on the order in which the physical data is inserted into the physical table. This is why I like to use that parking garage analogy.
The bad news is that I can take the same raw data, declare the same table structure to model the same entity on two different machines, shuffle the input a little bit, and then do one or more INSERT INTO
statements. The IDENTITY
columns which are supposed to represent attributes of the same entity in each copy of the same data will be different. IDENTITY
is not an intrinsic property of the entity, but of the storage!
The IDENTITY
feature exists because the first SQL engines were built on top of existing hardware and software. UNIX file systems were based on magnetic tape files which have to have sequential access by their physical nature. Indexing came along a little bit later and was put on top of this file model. The world might have been different if file systems had started off with hashing, but that would be rather complicated.
Dr. Codd and The Primary Key
In the original papers of RDBMS, Dr. Codd introduced the concept of keys and in particular the concept of a “primary key” which was dropped later. Keys were a part of the relational model from the start, but the primary key was supposed to be special in some way. And there was even a convention of underlying the column names that make up the primary key in the data declarations in the early papers. This was a holdover from sequential file days. Obviously, a physical file can only be sorted one way. File searches, particularly those on mag tape, depended on having a sorted order.
By definition, keys had to be all non-NULL
values; remember that a NULL
is not technically a value. But someone looked around and realized that the relational model was based on sets, which have no ordering. All keys have equal “key-ness” by their nature, and nobody is special.
Another construct in SQL was the UNIQUE
constraint. It explains itself, but it also permits a single NULL
value in each column. Remember that SQL has two models for comparisons. The first method is regular equality which compares all the NULL
s as UNKNOWN
with any values or other NULL
s.
The second method puts all the NULL
s Into a single group for purposes of the GROUP BY and a few other minor operators. We currently have added the infixed comparison operator:
1 |
IS [NOT] DISTINCT FROM |
which uses the grouping method, not equality. This means that it always returns TRUE
or FALSE
, so we don’t have to worry about comparing a NULL
to a value or another NULL
, and getting an UNKNOWN
result.
Basically, PRIMARY KEY (c1,c2,..cn) ⇒ UNIQUE (c1,c2,..cn) NOT NULL
. In fact, the BNF for these two constructs are part of the same production in the SQL grammar.
A foreign key is not a key in the referencing table but is it a key in a second (possibly the original) referenced table. The correct terms in SQL and RDBMS are not “parent” and “child” those terms are how we used, to identify records in network databases. The correct terms in RDBMS are “referenced” and “referencing” tables. Typically, in network databases, such relationships were implemented with pointer chains. Find an old timer and ask him about IDMS, TOTAL, Image, IMS and other early database products
Key Subtypes
There are a couple of terms that that it can be helpful to understand when dealing with key as there are two typical subtypes that are commonly discussed: simple and compound keys.
Simple Keys
A simple key is simply one column. this makes life easier simply by making your coding shorter. In most commercial databases, that one column is probably going to be some kind of identifier, such as a Vehicle Identification Number (VIN), International Standard Book Number (ISBN), UPC or some other industry code. Within that identifier you’re going to get worried about syntax, check digits, and verification procedures appropriate in that industry. Simple keys are not always that simple internally.
Compound Keys
A compound key is made up of two or more columns. Generally speaking, the relationship among the columns in the key can be as coordinates or as a hierarchy. The columns in a coordinate relationship are independent of each other; the most obvious example of this would be (Longitude, Latitude) pairs. This can be generalized to {c1, c2, .. cn}.
Another type of compound key is hierarchical. Another obvious example would be cities within states {city_name, state_code}, so we might have things like (‘Montgomery’, ‘OH’), (‘Springfield’, ‘OH’) and (‘Springfield’, ‘IL’). Please note that the combination of columns in a hierarchical compound key is what must be unique, but the components can have duplicates. Did you ever wonder why television shows use “Springfield” as a town name so often? It’s the most common town name in the United States, so you can leave the audience guessing in what state your show is taking place.
It is also possible to have a compound key that has both hierarchical and coordinate relationships among its columns.
Overlapping Compound Keys
Overlapping keys can be either hierarchical, coordinates, or compound. Their defining characteristic is that the two or more overlapping keys share columns in their declarations. One example of this might be different political districts drawn on the same geography. For example, in Texas, school districts, water districts, political districts and other administrative districts all overlap in weird ways. People living across the street from each other and find that they have to vote for different school board members and other services.
FOREIGN KEY and REFERENCES Clauses
There is a data integrity feature that I wish more people would use. It is the REFERENCES
clause. It says, in effect, that the columns mentioned in this are not really in this table and column but are referenced here. They are foreign keys which live somewhere else. It has to be a key so that a reference doesn’t go to more than one value. Here is a skeleton:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 |
CREATE TABLE Persons (employee_id CHAR(10) NOT NULL PRIMARY KEY --,.. ); CREATE TABLE Bowling_Teams (bowling_team_name CHAR(20) NOT NULL, player_id CHAR(10) NOT NULL REFERENCES Personnel (employee_id) --,.. ); CREATE TABLE Softball_Teams (softball_team_name CHAR(20) NOT NULL, player_id CHAR(10) NOT NULL REFERENCES Persons (employee_id) --,.. ); |
Note, please, that it’s possible to also declare PRIMARY KEY (bowling_team_name, player_id)
or UNIQUE (bowling_team_name, player_id)
or completely skip having any “unique-ifier” on the Persons
table (and likewise on the Bowling_Teams
table too).
In this skeleton example, we are trying to enforce business rules which say only one of our personnel can play on our company teams. You can also define a FOREIGN KEY
constraint on multiple columns, with the following SQL syntax:
1 2 3 4 |
CONSTRAINT <constraint name> FOREIGN KEY (<referencing column list> REFERENCES <referenced table name> (<referenced column list>). |
This detaches the constraint from a single column in the referencing table. This syntax lets you name multi-column foreign keys so you can handle them as a single unit of work. All the references had to reference what was actually there, so you couldn’t just drop or alter the referenced columns and break the relationship. You have to change the referencing columns and then you could drop or alter the referenced columns.
As a note, the only way to do this kind of references before constraints was with TRIGGER
objects.
Relational people tend to hate TRIGGER
objects because they are procedural code and we want a declarative language. We looked at what was actually being done with the references, we found that 95% or more of the procedural code did only a few tasks (often implementing relationships. This led us to add declarative referential integrity (DRI) clauses to the language which go at the end of the foreign key reference clause. Here’s a quick summary. For more on triggers and my disdain for them, I wrote this piece for Simple Talk back in 2016.
Declarative Referential Integrity (DRI) Actions
In this section, I will introduce the different DRI actions that you have to choose from when building your foreign key references objects. When you create a constraint, as shown, we have syntax as shown in the previous section with the one addition:
1 2 3 4 |
CONSTRAINT <constraint name> FOREIGN KEY (<referencing column list> REFERENCES <referenced table name> (<referenced column list>) [DRI ACTION]. |
The DRI ACTION
clause which will tell you what will happen when references rows are removed. In the following sections I will cover the action types. It is optional and the default will be covered in the first subsection.
ON [DELETE | UPDATE] NO ACTION
This is the default behavior if ON DELETE
or ON UPDATE
is not specified. If an attempt is made to delete referenced rows that are currently referenced, the database will raise an error and the delete will be rolled back.
ON [DELETE | UPDATE] RESTRICT
For the majority of relational database systems, RESTRICT
is equivalent to NO ACTION
. Where RESTRICT
is specifically implemented by a database system, the difference is that RESTRICT
results in the constraint being checked at the start of processing instead of at the end, which is the case with NO ACTION
.
ON [DELETE | UPDATE] SET NULL
If referenced row is deleted, the referencing values are updated to NULL
. Obviously, the referencing column have to be NULL
-able. This option is not used often in practice because the referenced columns are most often primary keys or unique constraints being used to define keys. People tend to avoid NULLs
in such constructs.
ON [DELETE | UPDATE] SET DEFAULT
When referenced rows are deleted or updated, the foreign keys are changed to their DEFAULT
value. Obviously, the referenced columns must either have a DEFAULT
value specified, or be NULL
-able, in which case NULL
is considered the DEFAULT
value. Any non-NULL
DEFAULT
value must be legal in the referencing table. You can think of NULL
as the universal default default.
ON [DELETE | UPDATE] CASCADE
This is the workhorse. Let’s break it down into both options starting with ON DELETE CASCADE.
When a reference row is deleted, all the referencing rows are also deleted. This means that if we fire employee #123, then he or she is automatically removed from the bowling team and the softball team, without us having to invoke separate DELETE FROM
statements. But more than that, cascade can keep going!
I can build a network of cascades, so that when the employee is removed from the Personnel table, they are automatically removed from the Softball_Teams
table, removing them from the softball team table then can cascade to the “Beer and Pizzas
” table that was referenced by the Softball_Teams
, and so forth for several levels. You might want to look at how many levels your SQL engine can handle before you get too deep in the nesting. The engine must remember all of this, just in case there’s a ROLLBACK
.
The ON UPDATE CASCADE
behaves pretty much the same way. An update to the key in the referenced table is propagated to the referencing tables automatically. Again, if you build chains of references, then the system has to remember the previous values in case of a ROLLBACK
.
Self Referencing Constraints
If you want to create a table with a self-referencing FOREIGN KEY
a column, you can do this in one CREATE TABLE
statement in SQL Server.
1 2 3 4 5 6 7 |
CREATE TABLE Personnel (emp_id CHAR(10) NOT NULL PRIMARY KEY, emp_name VARCHAR(20), manager_emp_id CHAR(10) --this is NULL-able CONSTRAINT ref_mgr_emp_id FOREIGN KEY (manager_emp_id) REFERENCES Personnel(emp_id)) |
Please ignore the fact that this is a generally bad design. Being somebody’s manager is not an attribute of the employee; it’s a relationship among employees. But the important point is that before you can reference something it has to exist. In SQL, a CREATE TABLE
statement occurs all at once as a transaction.
That means its columns all appear the same time. If you’re interested. Standard SQL has a CREATE SCHEMA
statement that brings everything into existence all at once. This means that some circular references can be constructed and a lot of other things I will not bother with here.
All of this can get a little tricky. Let’s say I want to create a history table such that the start date of one event is guaranteed to be a termination date of a prior event. Basically, were building a temporal chain. The problem is getting this started in SQL. When I insert my initial event, the preceding event column has nothing to reference! Even the initial event doesn’t exist yet because were trying to insert it.
The solution is to turn off the constraints, insert the initial event, and then turn them back on again. So, the second event can reference back to the first event. You can do this in SQL Server with some techniques developed by Alex Kuznetsov or in ANSI/ISO Standard SQL some ugly looking syntax features which control when a constraint is applied.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
CREATE TABLE Events ( event_id CHAR(10) NOT NULL, previous_event_end_date DATE NOT NULL CONSTRAINT FK_Chained_Dates REFERENCES Events (event_end_date), event_start_date DATE NOT NULL, -- NULL means event ongoing event_end_date DATE UNIQUE, PRIMARY KEY (event_id, event_start_date), --Start must be less or equal to end CONSTRAINT Event_Order_Valid CHECK (event_start_date <= event_end_date), --previous end must be one day before start date CONSTRAINT CK_Chained_Dates CHECK (DATEADD(DAY, 1, previous_event_end_date) = event_start_date) -- alternative solution -- CHECK (previous_event_end_date + INTERVAL '01' DAYS) -- = event_start_date) ); -- disable the Chained_Dates constraint ALTER TABLE Events NOCHECK CONSTRAINT FK_Chained_Dates; -- insert a starter row INSERT INTO Events(event_id, previous_event_end_date, event_start_date, event_end_date) VALUES ('Foo Fest', '2010-01-01', '2010-01-02', '2010-01-05'); -- enable the constraint in the table ALTER TABLE Events CHECK CONSTRAINT FK_Chained_Dates; -- this works INSERT INTO Events(event_id, previous_event_end_date, event_start_date, event_end_date) VALUES ('Glob Week', '2010-01-05', '2010-01-06', '2010-01-10'); --Fails because the previous_event_end date is not in the table --AND fails because the previous_end_date is not 1 day before the --event_start_date INSERT INTO Events(event_id, previous_event_end_date, event_start_date, event_end_date) VALUES ('Snoob', '2010-01-09', '2010-01-11', '2010-01-15'); |
Starting to see why I’ve been saying that most of the hard work in SQL is done in the DDL and not in the DML?
Conclusion
Much like so much of my advice on employing DDL to protect data, using PRIMARY KEY
, UNIQUE
, FOREIGN KEY
/REFERENCES
type objects is still a great way to protect the integrity of your data rather than attempting to do this on your own using SQL code, TRIGGER objects, or though application code.
Load comments