The DRI Subject of References

A database must be able to maintain and enforce the business rules and relationships in data in order to maintain the data model. It does this through referential constraints. They aren't complex, but are powerful, especially with the means to attach DRI actions to them. Joe Celko explains all, and pines for the ANSI CREATE ASSERTION statement

In the early days of SQL, it was assumed that SQL would never be ready for prime time. Oh, it might be useful for ad hoc data analysis of some kind but not the heavy lifting required for production work. That would always belong to COBOL with ISAM, IDMS. IMS, Total and other network based models of databases and file systems.

Well, that turned out not to be true. But there is more to the story than just “SQL got bigger tables and faster searches”; SQL got data integrity features! This is a major difference between a file system and a database. A file system simply holds the data, without any regard to what it means, business rules or relationships in the data. That kind of thing was the job of an application program. Lots and lots of application programs. You could not keep all those programs in synch and could guarantee data integrity.

A database should enforce and maintain business rules and relationships in the data. This is not part of any application program; it is part of the data model. By having rules separated from the applications, we know that everyone is playing with the same data and done one way, in one place, one time. And we do not have to write thousands of line of the same code over and over. And then maintain them. Yes, it is possible to subvert data integrity, but you have to go out of your way to do it.

In data modeling, we have the concept of strong and weak entities. A strong entity exists on its own merits, and a weak entity has to have a strong entity paired with it. The classic example is the SQL idiom for an Orders database we have all seen. Here is the simplest version and it is wrong:

There are lots of errors here. The physical line number in the order details table is not a logical construct. This model copies the paper order form’s lines into rows in a table. We do not do that in a valid logical data model! I can order the same carton of buzzard eggs several times on different lines on a paper form, but in the logical model, I want to remove that redundancy by having an “order_qty" column. And if you are a good programmer, you will have a CHECK (order_qty > 0) on the column.

The concept of a tiered architecture is strange to old monolithic procedural language programmers. You need an input layer that will aggregate those buzzard eggs under the same sku (Stock Keeping Unit), as a single order detail. This is also where we can validate check digits, standardize the street address and phone numbers, etc. The goal is to put only clean data into the database, not use the base tables as a staging area.

This first skeleton schema had no Declarative Referential Integrity (DRI) between these two tables. Look at what happens when an Order is dropped; the Order Details will be orphaned. I have seen SQL systems where the orphans fill the disk and have to be cleaned out with homemade garbage collection procedures every few weeks. Performance is awful, of course. Before you ask, the old network databases had their versions of this problem. But they came with built-in garbage collection procedures that had to be used on a schedule as part of a defragmentation run.

SQL got smarter, started doing maintenance functions automatically and added a declarative REFERENCES clause on the weak entity table. Here is the improved version:

The Orders table is called the referenced table and the Order_Details table is called the referencing table. Please notice that we do not have “parent” and “child” tables in RDBMS. Those terms are from the old network databases. They were implemented by pointers and had to be between two distinct “tables” (actually, files in those days) !

Too many new SQL programmers will stop at this point. However, we have more power from the options in SQL. The referenced columns do not have to be the PRIMARY KEY; they just have to be a key, which means that they are declared with NOT NULL UNIQUE in the referenced table. But keys can have more than one column! You can use a FOREIGN KEY() clause when you have a multiple column reference. The syntax is

The referenced column list must match, by data type and position, to the referencing column list. This is called “union compatible” in the jargon. Many noobs are so afraid of multi-column keys, they will use all kinds of non-relational kludges to work around them and get a single column “surrogate key” (actually, Dr. Codd said that surrogates are not exposed to or maintained by the user, but skip that for now). Let’s explain this with an example, say (someplace_longitude, someplace_latitude), If you assign a GUID, concatenate the values of the pair or whatever kludge you like, then how do you assure that your “surrogate” is properly matched to a real key pair? And why did you want to increase redundancy? Learn to use cut & paste if you think that the lengths of column lists are a problem.

When you give names to the constraints with the optional CONSTRAINT <constraint name> clause, the errors messages will use the given name and a human being can read them. These names are global to the schema and, not just local to a table. But these constraint names also mean you can also turn the constraint off then back on. The syntax is …

In the SQL model, we have no idea how REFERENCES is implemented and we do not care. We can have a reference back to the same table, make cycles or pretty any kind of path you want. It is a more abstract and general concept than pointers. Hashing is another approach, for example, that other SQL products use.

You can also add a DRI action clause to the REFERENCES. They were meant to replace TRIGGERs with declarative code. A TRIGGER cannot be optimized because it is procedural code that could do anything in the schema. We noticed that 80-90% of the TRIGGERs in the real world were used to maintain the weak-strong entity relationships, so we made it into a declarative feature. Right now, T-SQL is not that good about optimizing the DRI, like other SQL products, but there is hope for the future releases.

In the ANSI/ISO standard model for both TRIGGERs and DRI, an INSERT or DELETE is called a database event and it causes a response. Please note that an INSERT is not an event in the ANSI/ISO world, so the T-SQL model is a little different. In fact, T-SQL’s DRI is minimal, but very useful. In the full ANSI/ISO syntax, there are more options about matching rules, and deferring constraints. The full ANSI/ISO Standard is also more forgiving about cycles than T-SQL, which disallows any of them.

The DRI syntax is

The actions change the referencing table when the referenced table has an event.

  • NO ACTION is the default action. If a DELETE or UPDATE is executed on referenced rows, you get an error message about a constraint violation. This is what happens with just the REFERENCES by itself, but it makes things complete.
  • CASCADE specifies that the referencing column will be updated when the referenced column is updated, and referencing rows will be deleted when the referenced rows are deleted. When a referenced row is change, this can change another table with DRI actions. The SQL engine has to keep track of all of this stuff because if anything goes wrong, the engine has to do a ROLLBACK. This can be a lot of overhead and get tricky. Try to keep a simple tree structure in the DRI network.
  • SET DEFAULT: The referencing column(s) will be set to their DEFAULT values when an UPDATE or DELETE is performed on referenced rows. Obviously, the referencing column(s) must have DEFAULT clauses.
  • SET NULL: This action specifies that the column will be set to NULL when the referenced column is updated or deleted. Obviously, the referencing column(s) must be NULL-able.

Required Weak Entities

 Have you noticed that the referencing table has to find a match in the referenced table, but not vice versa? Thinking about the Orders schema, this says that we cannot take an order for a product we do not have in stock. You will get a constraint violation message and a ROLLBACK. But it is fine to have products in stock that nobody wants to order.

You might have a business rule that allows an order to be posted, but not have any items on the order. This would be an abandoned shopping cart on a website. Now consider a schema that models Personnel and their job skills. We can use the Standard Occupational Classification (SOC) for the skills.

This guarantees that each employee will have at least one skill. But this skill is modeled as an attribute of an employee, not a weak entity related to an employee. This means we need to union the two tables get all the SOC codes together!

But this is a mess. I can have the same SOC code in both tables; clearly a redundancy. This skeleton usually occurs when each employee started with only one required skill and it was put into the Personnel table as if it was an attribute and not a relationship. When the requirements changed to include multiple skills, the programmer added a new table rather than more columns. The extra columns would be a repeated group and violate First Normal Form (1NF). But this design already has a repeated group violation – the group simply has one member!

I can keep the skills unique between the two tables with a trigger and some procedural code. But his means procedural code and we hate procedural code. It is also trickier than you think. If I delete the skill in the Personnel table, then I have to look at the Skills table.

  • If the Skills table has more than one skill left, then you need a rule for moving a skill to the Personnel table. You will have to turn off the constraints, lock the tables, move the data and finally turn on the constraints again.
  • If the Skills table has just one skill left, then you need to move that skill to the Personnel table. But then you need to get clean out the skills table.

This code cannot be optimized and probably not ported easily. You really need to go back to the original strong-weak table model, so the SOC code appears in one and only one place. What happens when we have a cycle between the tables? It is a formal version of “You need experience to get work and you need work to get experience” vicious cycle in employment.

In full ANSI/ISO Standard SQL, we have the CREATE ASSERTION statement. This is a schema level CHECK() constraint. This is why a constraint name has to be global and not local to a table.

What this says is that when you remove all the employees in the Skills table from all the employees in the Personnel table, you should get an empty table. Since this is at the schema level, a violation will cause an error in your current session.

But SQL Server doe not have the CREATE ASSERTION or deferrable constraints. It has the WITH CHECK OPTION on VIEWs and you can “fake it” a little bit with this feature. This feature says to re-check the WHERE clause if the underlying table(s) changes. The VIEW has to be on one table and updatable for this to work best.

And likewise, you can copy the constraint into a second view that covers the Personnel table data.

Let’s see how this works. Add a new employee:

But now the Valid_Skills VIEW is empty! The lack of skills for Wilma empty the whole VIEW. You can add the skills and Valid_Skills will return:

We do not really need the with check option in this case, but it is a good safety feature if people use the view for updating. The real trick is to write procedures that updates, inserts into and deletes from both base tables in a single transaction. I will leave that as a exercise for the reader.