In the early days of SQL, it was assumed that SQL would never be ready for prime time. Oh, it might be useful for ad hoc data analysis of some kind but not the heavy lifting required for production work. That would always belong to COBOL with ISAM, IDMS. IMS, Total and other network based models of databases and file systems.
Well, that turned out not to be true. But there is more to the story than just “SQL got bigger tables and faster searches”; SQL got data integrity features! This is a major difference between a file system and a database. A file system simply holds the data, without any regard to what it means, business rules or relationships in the data. That kind of thing was the job of an application program. Lots and lots of application programs. You could not keep all those programs in synch and could guarantee data integrity.
A database should enforce and maintain business rules and relationships in the data. This is not part of any application program; it is part of the data model. By having rules separated from the applications, we know that everyone is playing with the same data and done one way, in one place, one time. And we do not have to write thousands of line of the same code over and over. And then maintain them. Yes, it is possible to subvert data integrity, but you have to go out of your way to do it.
In data modeling, we have the concept of strong and weak entities. A strong entity exists on its own merits, and a weak entity has to have a strong entity paired with it. The classic example is the SQL idiom for an Orders database we have all seen. Here is the simplest version and it is wrong:
1 2 3 4 5 6 7 8 9 |
CREATE TABLE Orders -- strong entity (order_nbr CHAR(10) NOT NULL PRIMARY KEY, ..); CREATE TABLE Order_Details -- weak entity (order_nbr CHAR(10) NOT NULL, line_nbr INTEGER NOT NULL, sku CHAR(15) NOT NULL, PRIMARY KEY (order_nbr, line_nbr), ..); |
There are lots of errors here. The physical line number in the order details table is not a logical construct. This model copies the paper order form’s lines into rows in a table. We do not do that in a valid logical data model! I can order the same carton of buzzard eggs several times on different lines on a paper form, but in the logical model, I want to remove that redundancy by having an “order_qty"
column. And if you are a good programmer, you will have a CHECK (order_qty > 0)
on the column.
The concept of a tiered architecture is strange to old monolithic procedural language programmers. You need an input layer that will aggregate those buzzard eggs under the same sku (Stock Keeping Unit), as a single order detail. This is also where we can validate check digits, standardize the street address and phone numbers, etc. The goal is to put only clean data into the database, not use the base tables as a staging area.
This first skeleton schema had no Declarative Referential Integrity (DRI) between these two tables. Look at what happens when an Order is dropped; the Order Details will be orphaned. I have seen SQL systems where the orphans fill the disk and have to be cleaned out with homemade garbage collection procedures every few weeks. Performance is awful, of course. Before you ask, the old network databases had their versions of this problem. But they came with built-in garbage collection procedures that had to be used on a schedule as part of a defragmentation run.
SQL got smarter, started doing maintenance functions automatically and added a declarative REFERENCES
clause on the weak entity table. Here is the improved version:
1 2 3 4 5 6 |
CREATE TABLE Order_Details (order_nbr CHAR(10) NOT NULL REFERENCES Orders(order_nbr), sku CHAR(15) NOT NULL, PRIMARY KEY (order_nbr, sku), ..); |
The Orders table is called the referenced table and the Order_Details
table is called the referencing table. Please notice that we do not have “parent” and “child” tables in RDBMS. Those terms are from the old network databases. They were implemented by pointers and had to be between two distinct “tables” (actually, files in those days) !
Too many new SQL programmers will stop at this point. However, we have more power from the options in SQL. The referenced columns do not have to be the PRIMARY KEY
; they just have to be a key, which means that they are declared with NOT NULL
UNIQUE
in the referenced table. But keys can have more than one column! You can use a FOREIGN KEY
() clause when you have a multiple column reference. The syntax is
1 2 3 |
[CONSTRAINT <constraint name>] FOREIGN KEY (referencing_col1, .., referencing_col_n) REFERENCES <referenced table name> (referenced_col1, .., referenced_col_n) |
The referenced column list must match, by data type and position, to the referencing column list. This is called “union compatible” in the jargon. Many noobs are so afraid of multi-column keys, they will use all kinds of non-relational kludges to work around them and get a single column “surrogate key” (actually, Dr. Codd said that surrogates are not exposed to or maintained by the user, but skip that for now). Let’s explain this with an example, say (someplace_longitude, someplace_latitude
), If you assign a GUID, concatenate the values of the pair or whatever kludge you like, then how do you assure that your “surrogate” is properly matched to a real key pair? And why did you want to increase redundancy? Learn to use cut & paste if you think that the lengths of column lists are a problem.
When you give names to the constraints with the optional CONSTRAINT
<constraint name
>
clause, the errors messages will use the given name and a human being can read them. These names are global to the schema and, not just local to a table. But these constraint names also mean you can also turn the constraint off then back on. The syntax is …
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 |
ALTER TABLE <table name> [NOCHECK | CHECK] CONSTRAINT <constraint name> Consider this self-reference trick to prevent gaps in a timeline of events: CREATE TABLE Events (event_id CHAR(10) NOT NULL, previous_event_end_date DATE NOT NULL CONSTRAINT Chained_Dates REFERENCES Events (event_end_date), event_start_date DATE NOT NULL, event_end_date DATE UNIQUE, -- null means event in progress PRIMARY KEY (event_id, event_start_date), CONSTRAINT Event_Order_Valid CHECK (event_start_date <= event_end_date), CONSTRAINT Chained_Dates CHECK (DATEADD(DAY, 1, previous_event_end_date) = event_start_date). << other stuff for this event >> ); -- disable the Chained_Dates constraint ALTER TABLE Events NOCHECK CONSTRAINT Chained_Dates; -- insert a starter row INSERT INTO Events(event_id, previous_event_end_date, event_start_date, event_end_date) VALUES ('Foo Fest', '2010-01-01', '2010-01-02', '2010-01-05'); -- enable the constraint in the table ALTER TABLE Events CHECK CONSTRAINT Chained_Dates; -- this works INSERT INTO Events(event_id, previous_event_end_date, event_start_date, event_end_date) VALUES ('Glob Week', '2010-01-05', '2010-01-06', '2010-01-10'); -- this fails INSERT INTO Events(event_id, previous_event_end_date, event_start_date, event_end_date) VALUES ('Snoob', '2010-01-09', '2010-01-11', '2010-01-15'); |
In the SQL model, we have no idea how REFERENCES
is implemented and we do not care. We can have a reference back to the same table, make cycles or pretty any kind of path you want. It is a more abstract and general concept than pointers. Hashing is another approach, for example, that other SQL products use.
You can also add a DRI action clause to the REFERENCES
. They were meant to replace TRIGGER
s with declarative code. A TRIGGER
cannot be optimized because it is procedural code that could do anything in the schema. We noticed that 80-90% of the TRIGGER
s in the real world were used to maintain the weak-strong entity relationships, so we made it into a declarative feature. Right now, T-SQL is not that good about optimizing the DRI, like other SQL products, but there is hope for the future releases.
In the ANSI/ISO standard model for both TRIGGER
s and DRI, an INSERT
or DELETE
is called a database event and it causes a response. Please note that an INSERT
is not an event in the ANSI/ISO world, so the T-SQL model is a little different. In fact, T-SQL’s DRI is minimal, but very useful. In the full ANSI/ISO syntax, there are more options about matching rules, and deferring constraints. The full ANSI/ISO Standard is also more forgiving about cycles than T-SQL, which disallows any of them.
The DRI syntax is
1 |
ON [UPDATE| DELETE] <DRI action> |
The actions change the referencing table when the referenced table has an event.
- NO ACTION is the default action. If a
DELETE
orUPDATE
is executed on referenced rows, you get an error message about a constraint violation. This is what happens with just theREFERENCES
by itself, but it makes things complete. -
CASCADE
specifies that the referencing column will be updated when the referenced column is updated, and referencing rows will be deleted when the referenced rows are deleted. When a referenced row is change, this can change another table with DRI actions. The SQL engine has to keep track of all of this stuff because if anything goes wrong, the engine has to do aROLLBACK
. This can be a lot of overhead and get tricky. Try to keep a simple tree structure in the DRI network. - SET DEFAULT: The referencing column(s) will be set to their
DEFAULT
values when anUPDATE
orDELETE
DEFAULT
clauses. - SET NULL: This action specifies that the column will be set to
NULL
when the referenced column is updated or deleted. Obviously, the referencing column(s) must beNULL
-able.
Required Weak Entities
Have you noticed that the referencing table has to find a match in the referenced table, but not vice versa? Thinking about the Orders
schema, this says that we cannot take an order for a product we do not have in stock. You will get a constraint violation message and a ROLLBACK
. But it is fine to have products in stock that nobody wants to order.
You might have a business rule that allows an order to be posted, but not have any items on the order. This would be an abandoned shopping cart on a website. Now consider a schema that models Personnel and their job skills. We can use the Standard Occupational Classification (SOC) for the skills.
1 2 3 4 5 6 7 8 9 10 11 12 |
CREATE TABLE Personnel (emp_id CHAR(12) NOT NULL PRIMARY KEY, soc CHAR(7) NOT NULL CHECK (soc LIKE '[0-9][0-9]-[0-9][0-9][0-9][0-9]'), ); CREATE TABLE Skills (soc CHAR(7) NOT NULL CHECK (soc LIKE '[0-9][0-9]-[0-9][0-9][0-9][0-9]'), emp_id CHAR(12) NOT NULL REFERENCES Personnel (emp_id), PRIMARY KEY(soc, emp_id) ); |
This guarantees that each employee will have at least one skill. But this skill is modeled as an attribute of an employee, not a weak entity related to an employee. This means we need to union the two tables get all the SOC codes together!
1 2 3 4 5 |
CREATE VIEW SOC_List (emp_id, soc) AS (SELECT emp_id, soc FROM Skills UNION SELECT emp_id, soc FROM Personnel); |
But this is a mess. I can have the same SOC code in both tables; clearly a redundancy. This skeleton usually occurs when each employee started with only one required skill and it was put into the Personnel table as if it was an attribute and not a relationship. When the requirements changed to include multiple skills, the programmer added a new table rather than more columns. The extra columns would be a repeated group and violate First Normal Form (1NF). But this design already has a repeated group violation – the group simply has one member!
I can keep the skills unique between the two tables with a trigger and some procedural code. But his means procedural code and we hate procedural code. It is also trickier than you think. If I delete the skill in the Personnel table, then I have to look at the Skills table.
- If the Skills table has more than one skill left, then you need a rule for moving a skill to the Personnel table. You will have to turn off the constraints, lock the tables, move the data and finally turn on the constraints again.
- If the Skills table has just one skill left, then you need to move that skill to the Personnel table. But then you need to get clean out the skills table.
This code cannot be optimized and probably not ported easily. You really need to go back to the original strong-weak table model, so the SOC code appears in one and only one place. What happens when we have a cycle between the tables? It is a formal version of “You need experience to get work and you need work to get experience” vicious cycle in employment.
In full ANSI/ISO Standard SQL, we have the CREATE ASSERTION
statement. This is a schema level CHECK()
constraint. This is why a constraint name has to be global and not local to a table.
1 2 3 4 5 |
CREATE ASSERTION One_Skill_per_Employee CHECK (NOT EXISTS (SELECT emp_id FROM Personnel EXCEPT SELECT emp_id FROM Skills)); |
What this says is that when you remove all the employees in the Skills table from all the employees in the Personnel table, you should get an empty table. Since this is at the schema level, a violation will cause an error in your current session.
But SQL Server doe not have the CREATE ASSERTION
or deferrable constraints. It has the WITH CHECK OPTION
on VIEWs and you can “fake it” a little bit with this feature. This feature says to re-check the WHERE
clause if the underlying table(s) changes. The VIEW
has to be on one table and updatable for this to work best.
1 2 3 4 5 6 7 8 9 10 11 |
CREATE VIEW Valid_Skills AS SELECT * FROM Skills WHERE NOT EXISTS (SELECT emp_id FROM Personnel EXCEPT SELECT emp_id FROM Skills) WITH CHECK OPTION; |
And likewise, you can copy the constraint into a second view that covers the Personnel
table data.
1 2 3 4 5 6 7 8 9 10 |
CREATE VIEW Valid_Personnel AS SELECT * FROM Personnel WHERE NOT EXISTS (SELECT emp_id FROM Personnel EXCEPT SELECT emp_id FROM Skills) WITH CHECK OPTION; |
Let’s see how this works. Add a new employee:
1 |
INSERT INTO Personnel VALUES ('Wilma', '03-0000'); |
But now the Valid_Skills VIEW i
s empty! The lack of skills for Wilma empty the whole VIEW
. You can add the skills and Valid_Skills
will return:
1 |
INSERT INTO Skills VALUES ('Wilma', '03-0001'), ('Wilma', '03-0002'); |
We do not really need the with check
option in this case, but it is a good safety feature if people use the view for updating. The real trick is to write procedures that updates, inserts into and deletes from both base tables in a single transaction. I will leave that as a exercise for the reader.
Load comments