Uniqueness, Keys, and Identity

Comments 0

Share to social media

If you’ve ever had a traditional logic course, you’ll have run into “The Law of Identity” as the founding principle of all Western thought. It says that:

“To be is to be something in particular; to be nothing in particular or anything in general, is to be nothing at all.”

This usually gets expressed as “A = A” thanks to Ayn Rand’s Objectivism and credited to Aristotle. You want to dig a little deeper, however, it actually predates Aristotle’s laws of logic.

More to the point, this is why we have keys in the relational data model and why those keys give an identity to each row in a table which models an entity in a set.

Later Gottfried Leibniz, whom you might remember as the “other guy” who invented calculus, held that Entity x is the same as Entity y if and only if every predicate true of x is true of y as well. What it means for practical purposes for us RDBMS people is that you should not have two or more tables with exactly the same structure. One set of entities should have one and only one table to model it.

This can be a little more subtle than you might think. There is a design flaw called attribute splitting, in which has defined extra tables based on the values of an attribute. For example, you would never define your primary human resources table based on the type of phone they use: iPhone_Using_Persons and another table for Android_Using_Persons. The type of phone an employee uses is clearly an attribute and this design have splits what should be one table into two based on the value of an attribute, and not on attribute itself. And then what if another phone type emerges? New table?

There are caveats of course. You might need a table such as this to record specific attributes of that person, but that is a different topic. It would not be the main Persons table.

Concept of Keys in RDBMS

In RDBMS, we establish uniqueness of the rows in the table by using a key. A key is a subset of the columns of that table which is unique in that table. Please note that by definition, a key cannot be a table property, like IDENTITY or a GUID. A key must be made up of attributes, which means actual table columns.

IDENTITY is a table property, much like the parking space number in a garage. It is not an attribute of the entity being modeled by the table (like the VIN number burned into the engine block), but a value derived from the physical storage of the data. Trust me, no automobile, employee, piece of merchandise, or anything else you’re trying to represent with the table came with some IDENTITY tattooed on it or as an innate attribute. Just different parking garages can have spaces with the same numbers, so totally unrelated tables (even in different schemas!) can have the same value in an IDENTITY column.

There’s only one IDENTITY allowed per a table. If it were an attribute, it might be possible for an entity to have more than one of that same kind of attribute. For example, a table that models automobile trips can have ZIP codes for both the departure and arrival locations. Attributes also don’t have to be all the same data type, but the IDENTITY column is always a whole number (typically using an integer type of some sort, but can also use a numeric type.) The Microsoft implementation is also issued sequentially. This sequence is based on the order in which the physical data is inserted into the physical table. This is why I like to use that parking garage analogy.

The bad news is that I can take the same raw data, declare the same table structure to model the same entity on two different machines, shuffle the input a little bit, and then do one or more INSERT INTO statements. The IDENTITY columns which are supposed to represent attributes of the same entity in each copy of the same data will be different. IDENTITY is not an intrinsic property of the entity, but of the storage!

The IDENTITY feature exists because the first SQL engines were built on top of existing hardware and software. UNIX file systems were based on magnetic tape files which have to have sequential access by their physical nature. Indexing came along a little bit later and was put on top of this file model. The world might have been different if file systems had started off with hashing, but that would be rather complicated.

Dr. Codd and The Primary Key

In the original papers of RDBMS, Dr. Codd introduced the concept of keys and in particular the concept of a “primary key” which was dropped later. Keys were a part of the relational model from the start, but the primary key was supposed to be special in some way. And there was even a convention of underlying the column names that make up the primary key in the data declarations in the early papers. This was a holdover from sequential file days. Obviously, a physical file can only be sorted one way. File searches, particularly those on mag tape, depended on having a sorted order.

By definition, keys had to be all non-NULL values; remember that a NULL is not technically a value. But someone looked around and realized that the relational model was based on sets, which have no ordering. All keys have equal “key-ness” by their nature, and nobody is special.

Another construct in SQL was the UNIQUE constraint. It explains itself, but it also permits a single NULL value in each column. Remember that SQL has two models for comparisons. The first method is regular equality which compares all the NULLs as UNKNOWN with any values or other NULLs.

The second method puts all the NULLs Into a single group for purposes of the GROUP BY and a few other minor operators. We currently have added the infixed comparison operator:

which uses the grouping method, not equality. This means that it always returns TRUE or FALSE, so we don’t have to worry about comparing a NULL to a value or another NULL, and getting an UNKNOWN result.

Basically, PRIMARY KEY (c1,c2,..cn) ⇒ UNIQUE (c1,c2,..cn) NOT NULL. In fact, the BNF for these two constructs are part of the same production in the SQL grammar.

A foreign key is not a key in the referencing table but is it a key in a second (possibly the original) referenced table. The correct terms in SQL and RDBMS are not “parent” and “child” those terms are how we used, to identify records in network databases. The correct terms in RDBMS are “referenced” and “referencing” tables. Typically, in network databases, such relationships were implemented with pointer chains. Find an old timer and ask him about IDMS, TOTAL, Image, IMS and other early database products

Key Subtypes

There are a couple of terms that that it can be helpful to understand when dealing with key as there are two typical subtypes that are commonly discussed: simple and compound keys.

Simple Keys

A simple key is simply one column. this makes life easier simply by making your coding shorter. In most commercial databases, that one column is probably going to be some kind of identifier, such as a Vehicle Identification Number (VIN), International Standard Book Number (ISBN), UPC or some other industry code. Within that identifier you’re going to get worried about syntax, check digits, and verification procedures appropriate in that industry. Simple keys are not always that simple internally.

Compound Keys

A compound key is made up of two or more columns. Generally speaking, the relationship among the columns in the key can be as coordinates or as a hierarchy. The columns in a coordinate relationship are independent of each other; the most obvious example of this would be (Longitude, Latitude) pairs. This can be generalized to {c1, c2, .. cn}.

Another type of compound key is hierarchical. Another obvious example would be cities within states {city_name, state_code}, so we might have things like (‘Montgomery’, ‘OH’), (‘Springfield’, ‘OH’) and (‘Springfield’, ‘IL’). Please note that the combination of columns in a hierarchical compound key is what must be unique, but the components can have duplicates. Did you ever wonder why television shows use “Springfield” as a town name so often? It’s the most common town name in the United States, so you can leave the audience guessing in what state your show is taking place.

It is also possible to have a compound key that has both hierarchical and coordinate relationships among its columns.

Overlapping Compound Keys

Overlapping keys can be either hierarchical, coordinates, or compound. Their defining characteristic is that the two or more overlapping keys share columns in their declarations. One example of this might be different political districts drawn on the same geography. For example, in Texas, school districts, water districts, political districts and other administrative districts all overlap in weird ways. People living across the street from each other and find that they have to vote for different school board members and other services.

FOREIGN KEY and REFERENCES Clauses

There is a data integrity feature that I wish more people would use. It is the REFERENCES clause. It says, in effect, that the columns mentioned in this are not really in this table and column but are referenced here. They are foreign keys which live somewhere else. It has to be a key so that a reference doesn’t go to more than one value. Here is a skeleton:

Note, please, that it’s possible to also declare PRIMARY KEY (bowling_team_name, player_id) or UNIQUE (bowling_team_name, player_id) or completely skip having any “unique-ifier” on the Persons table (and likewise on the Bowling_Teams table too).

In this skeleton example, we are trying to enforce business rules which say only one of our personnel can play on our company teams. You can also define a FOREIGN KEY constraint on multiple columns, with the following SQL syntax:

This detaches the constraint from a single column in the referencing table. This syntax lets you name multi-column foreign keys so you can handle them as a single unit of work. All the references had to reference what was actually there, so you couldn’t just drop or alter the referenced columns and break the relationship. You have to change the referencing columns and then you could drop or alter the referenced columns.

As a note, the only way to do this kind of references before constraints was with TRIGGER objects.

Relational people tend to hate TRIGGER objects because they are procedural code and we want a declarative language. We looked at what was actually being done with the references, we found that 95% or more of the procedural code did only a few tasks (often implementing relationships. This led us to add declarative referential integrity (DRI) clauses to the language which go at the end of the foreign key reference clause. Here’s a quick summary. For more on triggers and my disdain for them, I wrote this piece for Simple Talk back in 2016.

Declarative Referential Integrity (DRI) Actions

In this section, I will introduce the different DRI actions that you have to choose from when building your foreign key references objects. When you create a constraint, as shown, we have syntax as shown in the previous section with the one addition:

The DRI ACTION clause which will tell you what will happen when references rows are removed. In the following sections I will cover the action types. It is optional and the default will be covered in the first subsection.

ON [DELETE | UPDATE] NO ACTION

This is the default behavior if ON DELETE or ON UPDATE is not specified. If an attempt is made to delete referenced rows that are currently referenced, the database will raise an error and the delete will be rolled back.

ON [DELETE | UPDATE] RESTRICT

For the majority of relational database systems, RESTRICT is equivalent to NO ACTION. Where RESTRICT is specifically implemented by a database system, the difference is that RESTRICT results in the constraint being checked at the start of processing instead of at the end, which is the case with NO ACTION.

ON [DELETE | UPDATE] SET NULL

If referenced row is deleted, the referencing values are updated to NULL. Obviously, the referencing column have to be NULL-able. This option is not used often in practice because the referenced columns are most often primary keys or unique constraints being used to define keys. People tend to avoid NULLs in such constructs.

ON [DELETE | UPDATE] SET DEFAULT

When referenced rows are deleted or updated, the foreign keys are changed to their DEFAULT value. Obviously, the referenced columns must either have a DEFAULT value specified, or be NULL-able, in which case NULL is considered the DEFAULT value. Any non-NULL DEFAULT value must be legal in the referencing table. You can think of NULL as the universal default default.

ON [DELETE | UPDATE] CASCADE

This is the workhorse. Let’s break it down into both options starting with ON DELETE CASCADE. When a reference row is deleted, all the referencing rows are also deleted. This means that if we fire employee #123, then he or she is automatically removed from the bowling team and the softball team, without us having to invoke separate DELETE FROM statements. But more than that, cascade can keep going!

I can build a network of cascades, so that when the employee is removed from the Personnel table, they are automatically removed from the Softball_Teams table, removing them from the softball team table then can cascade to the “Beer and Pizzas” table that was referenced by the Softball_Teams, and so forth for several levels. You might want to look at how many levels your SQL engine can handle before you get too deep in the nesting. The engine must remember all of this, just in case there’s a ROLLBACK.

The ON UPDATE CASCADE behaves pretty much the same way. An update to the key in the referenced table is propagated to the referencing tables automatically. Again, if you build chains of references, then the system has to remember the previous values in case of a ROLLBACK.

Self Referencing Constraints

If you want to create a table with a self-referencing FOREIGN KEY a column, you can do this in one CREATE TABLE statement in SQL Server.

Please ignore the fact that this is a generally bad design. Being somebody’s manager is not an attribute of the employee; it’s a relationship among employees. But the important point is that before you can reference something it has to exist. In SQL, a CREATE TABLE statement occurs all at once as a transaction.

That means its columns all appear the same time. If you’re interested. Standard SQL has a CREATE SCHEMA statement that brings everything into existence all at once. This means that some circular references can be constructed and a lot of other things I will not bother with here.

All of this can get a little tricky. Let’s say I want to create a history table such that the start date of one event is guaranteed to be a termination date of a prior event. Basically, were building a temporal chain. The problem is getting this started in SQL. When I insert my initial event, the preceding event column has nothing to reference! Even the initial event doesn’t exist yet because were trying to insert it.

The solution is to turn off the constraints, insert the initial event, and then turn them back on again. So, the second event can reference back to the first event. You can do this in SQL Server with some techniques developed by Alex Kuznetsov or in ANSI/ISO Standard SQL some ugly looking syntax features which control when a constraint is applied.

Starting to see why I’ve been saying that most of the hard work in SQL is done in the DDL and not in the DML?

Conclusion

Much like so much of my advice on employing DDL to protect data, using PRIMARY KEY, UNIQUE, FOREIGN KEY/REFERENCES type objects is still a great way to protect the integrity of your data rather than attempting to do this on your own using SQL code, TRIGGER objects, or though application code.

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.