Declarative SQL: Using References

There are several ingenious ways of using SQL References to enforce integrity declaratively. Declarative Referential Integrity (DRI) is more effective than using procedural code in triggers, procedures or application layers because it uses the SQL paradigm, thereby making optimisation easier and providing clearer expression of the rules underlying the data.

In this series, there are …

A Bit of History

It’s time for another history lesson. It’s time to return once more to the late 1960s with hip-hugger bellbottoms and psychedelic music. This is also when commercial disk drives just started to appear. The disk drives made random access possible in a way that that you could not do on magnetic tape.

Charles Bachman of IBM invented a system called IDS for General Electric. IBM had a hierarchical file system called IMS (Integrated Management System), but it lacked flexibility of Bachman’s more general data model. Bachman’s model was so successful for commercial products that it won him the ACM Turing award and pretty much determined how databases were built for the next few decades. That view was navigational, not relational.

Bachman’s talk for his ACM award was entitled “Programmer as Navigator” and that describes the sort of products that came out of his model. The records are strung together by access paths made up of pointer chains. The programmer imagined that he was on a scooter traveling on those paths and picking up the data as he went. We still processed data one record at a time, but the way we fetched that data was very different from tape files and sequential files.

In the 70s, CODASYL (“Conference/Committee on Data Systems Languages”, the creators of COBOL) attempted to standardize some of the terms and to get the ANSI X3H2 database standards committee to rubber stamp what they were doing. This led to a standard, called NDL (Network Database Language), that was passed, but never implemented and then left to expire. We never expected it to become a product, but we wanted to get the terminology in a standard, so that various products could then be compared objectively.

The method used in all of these early network database products involved linked lists based on pointers. Hashing, RAID, and other more exotic access methods were simply not implemented.

Unfortunately, a lot of the old network terminology and data modeling still exists in the cultural mindset of new SQL programmers. The term “link”, for example, refers to a pointer which can be traversed in either direction. A “slave – master” or “child – parent” was term for a pointer that could be traversed in only one direction. A “junction” table was a pointer structure that went from base file records to many different subordinate files. A “chain” was a link list of subordinate records in sorted order that had an end-of-the-chain marker of some kind.

Pointers used to be implemented as integers in the hardware. This is probably why IDENTITY and other “pseudo-pointers” in bad SQL are INTEGERS, SMALLINTS and BIGINT. The reason for this exact numeric data type was that, before relational databases, we did arithmetic on pointers, adding and subtracting constants to manipulate them inside the storage of the machine.

When we got to the C programming language, pointer arithmetic became possible but a really, really bad idea. Ask anyone who cheerfully overwrote the operating system on a DEC PDP-11 with it.

These links, chains, slaves, or whatever name these had depending on the particular vendor’s terminology, were not part of the data. They were part of the access method to get to the data.

References and pointers

In RDBMS and SQL, we do not have pointers. We have references. A reference is not a pointer; it’s a very different concept. It is more abstract than the pointer. There are only two kinds of tables in this model: referenced and referencing. When you say this out loud, you had to be careful because the two words sound so much alike, but they are descriptive. The referencing table has to match a subset of columns in the referenced table. In the full ANSI standard there are a lot of extra options on matching criteria, but they are seldom used and do not exist in SQL Server. SQL Server uses a simple grouping equivalence relation for matching.

The referencing table has a FOREIGN KEY constraint which lists the columns of a UNIQUE or PRIMARY KEY constraint in the referenced table, in the order they were declared in the referenced table.It is simply good practice that columns do not change names from table to table. Oh, the referencing and reference tables can be the same table, but this kind of self-reference in SQL Server will probably require turning off the check constraint, doing an action, and then turning the check constraint back on. In the full ansi version, constraints can be declared deferrable,  non-deferrable |  Initially deferred, and so on; it can be pretty complicated, but it saves a lot of procedural code.

Pointers are passive, but in contrast references are active. In the early SQL products, we used triggers because we had no choice. Remember that everything was still built on old file systems. Approximately 80% to 95% of the triggers enforced referential integrity with explicit procedural code.

Today, we have declarative referential integrity actions (DRI) in SQL. The SQL engine looks for a database event which is defined as a delete and update or an insertion to a table. The DRI actions do not care about insertions, but only deletions and updates. New SQL programmers are occasionally surprised when they find rows disappearing or changing in tables that they have not touched. This can mean that another session has fired a DRI action and they had no way of knowing it.

The syntax is a clause added with the list of columns involved in the referenced table. The default assumption is that you will use the PRIMARY KEY, but it is good practice to be explicit. The basic syntax is:

A FOREIGN KEY can have only one update rule and only one deletion rule. Here is a painfully detailed explanation of each.

  • ON UPDATE CASCADE: any change to a referenced column in the referenced table causes the same change to the corresponding referencing column in matching rows of the referencing table.
  • ON UPDATE SET NULL: any change to a referenced column in the referenced table causes the corresponding referencing column in matching rows of the referencing table to be set to NULL. All the columns that make up the foreign key are set to NULL
  • ON UPDATE SET DEFAULT: any change to a referenced column in the referenced table causes the corresponding referencing column in matching rows of the referencing table to be set to its default value.
  • ON UPDATE NO ACTION (the default): there is no referential update action; the referential constraint only specifies a constraint check. The ON UPDATE NO ACTION does not perform its constraint check until the entire set of rows to be updated has been processed.
  • ON UPDATE RESTRICT: any change to a referenced column in the referenced table is prohibited if there is a matching row. SQL Server does not have this option.

The deletion rule has the same options.

  • ON DELETE CASCADE: if a row of the referenced table is deleted, then all matching rows in the referencing table are deleted.
  • ON DELETE SET NULL: if a row of the referenced table is deleted, then all referencing columns in all matching rows of the referencing table to be set to NULL.
  • ON DELETE SET DEFAULT: if a row of the referenced table is deleted, then all referencing columns in all matching rows of the referencing table to be set to the column’s default value.
  • ON DELETE NO ACTION (the default): there is no referential delete action; the referential constraint only specifies a constraint check.
  • ON DELETE RESTRICT: it is prohibited to delete a row of the referenced table if that row has any matching rows in the referencing table. SQL Server does not have this option.

To be honest, the cascades are the most used option in this repertoire. This enforces a data-modeling principle. An entity is classified as weak or strong. A strong entity exist by itself, on its own its own merits a weak entity exist only because it is protected by a strong entity. For example, an Orders table is a strong entity that has to exist for the weak entity, Order_Details, to exist.

If we delete an order, then we expect that all of its details will also be deleted. In a badly designed system, you might be able to delete the order, but the order detail rows will continue to exist in their table; these unmatched rows are orphans.

The bad news is that you will get information about the cascaded action. The statement is done on the referenced table, so the message deals only with that table. You do not get a message about the referencing table(s)

But what if the weak entity has even weaker entities, subordinated to it? The action continues to cascade down the referential chain. This means that if you have got a spider’s web of DRIactions, they can go very deep, take some time to complete and tie up your machine. This is actually a small trade-off for data integrity. Otherwise, you would have to put this in the application layers, handle all of the table, locking yourself, and if the action fails, then do the ROLLBACK manually.

In the ANSI standards, table A can reference table B and vice versa. This is probably not a good idea, because SQL Server is very sensitive to cycles in the relation in the DRI chains. Consider how you would resolve the situation where table A cascades to table B and also to table C, but table B also cascades to table C. Which action (A or B) has an effect on table C? Since SQL is supposed to be a set oriented language, they both should happen at the same time. Obviously this does not work when both actions hit the same row and column, so it is disallowed.

Another version of the same problem would be if table A cascades to table B twice; that is, we have two separate foreign key references in A to table B. SQL Server disallows this, but some other SQL products will allow it, but only as long as there is no contradiction. Do not write code like this. It is just too hard to maintain..

SQL Server worries about cycles more than it should. Here is a skeleton that you should play with to get feel for how the cascades work.

Gives the error message that:

Here the weaker_items reference both the strong_items and the weak_items. But this will compile, since it has no cycles:

Take this skeleton and replace the CASCADE with “SET NULL“, “DEFAULT“, and ‘”NO ACTION” when you play with it. Finally, put some actions in a transaction and try a ROLLBACK. It is a good idea to draw a referential diagram (circles for tables, arrows for references) when things get tricky. As soon as you see a cycle (directed or undirected), try again.

Finally, we can have a self-reference. The best example of how to do this is Alexander Kuznetsov’s contiguous dates constraint idiom.

In a full ANSI/ISO Standard SQL, we could have declared the constraints to be initially deferred, but putting it into SQL Server it take a little work. We have to use a procedure with explicit code in it.

Conclusion

While there are limitations, I hope I have convinced you that DRI actions are a better choice than trying to do this with procedural code in triggers, procedures or application layers. This is one of the fundamental differences between a file system (passive dumb storage) and a database (active, smart storage).