Department of Redundancy Department

Ever wonder all the reasons that we use databases instead of file systems? While we don’t think of it too much anymore, the first reason that databases came into existence was to remove redundancies. The first source of redundancy back in the dark ages, when I was just beginning the program, was a product called “carbon paper.” You younger people probably have not seen it. Paper forms came with sheets of this thin black carbon paper between duplicates of the form. As you pressed on the paper form, pressure transferred the image to the duplicates.

If you don’t believe me, Google it. Carbon paper was so ubiquitous that there is a 1950s science fiction short story about a terrible weapon, which was a bacteria or fungus (I cannot remember which) that eats carbon paper. it destroyed the bureaucracies which depended on multiple copies.

Today, even paper forms are becoming rare. For a time, you had to have a paper copy to have a legal signature. In some states the color of the ink which you had to use was also prescribed by law. Today electronic signatures of various kinds are acceptable. I am waiting for the day soon when either facial recognition or DNA will be the acceptable signature.

Very often, each of these copies that were made from the carbon paper was sent to different departments. Each department would keypunch the same data into their files, using a different file layout. There’s a classic horror story about a federal employee who got a reward when he discovered that one of the copies was never used and just thrown away. After giving him his “employee of the month” reward, the bureaucracy bought filing cabinets to save this redundant copy. After all, you never know when you might need it again.

Just as a man with two wristwatches is never truly sure exactly what time it is, so these early file systems could not produce a consistent, verifiable version of the data. As an actual example, I worked for a state highway department which used a multipart form reporting time on projects. We used one copy for employee’s time, one copy for equipment usage and one copy for project expenses. I could look at the time clock file and figure out how many hours a particular employee had charged each week. Likewise, I could go to the equipment usage file and figure out how many hours on which job a particular piece of equipment had been employed.

But the data was never coordinated, so it was almost impossible to answer a query like “how many hours did Bill Smith charge to digging that drainage ditch for the Owl Creek Bridge, and did he use a shovel or a backhoe?” from these files. You had to go back to the original paper forms and calculate the stuff by hand!

Unlike tables in a schema in RDBMS, paper files do not have to have any relationships among them. What we wanted was one fact, in one place, one time and modeled in only one way. This is what databases were supposed to do. But they could still have problems.

Row Level Redundancies

Consider a common business situation, the basic order form, with a skeleton table declaration like this:

If you have ever worked in retail, you no doubt would know that (order_qty * item_unit_price) = item_extension. The input screen on order entry programs will often do this calculation for you. But my point is storing something that can be calculated one and only one way is redundant. You can either hide calculations in a VIEW or use a computed column. It is also safer, since the computer probably makes few calculation errors than you do.

The basic syntax for a computed column is <column_name> AS <expression> [PERSISTED]. A persisted option says that after the value is computed, it is materialized and physically stored in the table. Since the value is now real instead of virtual, it can be used for indexing. There are limitations on the expression; but basically, it has to be something that can be calculated from the other columns in each row.

Columns In Different Tables

Looking at my Orders table, I see I have a reference to the global trade item number (GTIN) of each line item. If you’re not familiar with this, it’s like the UPC code used in North America for retail trade. It’s a little longer, and it lets you determine the country of origin.

What this clause says that the Inventory table uses the GTIN as a key; so it is the referenced table. The Orders table has a relationship such that it’s GTIN values have to be in the inventory table. Orders is called the referencing table. Don’t be fooled just because a reference looks like a redundancy.

The FOREIGN KEY or REFERENCES clause defines all columns explicitly that belong to the foreign key. The REFERENCES clause specifies the table name with all columns that build the referenced key. The number and the data types of the columns in the FOREIGN KEY clause must match the number and the corresponding data types of columns in the REFERENCES clause (and, of course, both of these must match the number and data types of the columns in the key of the referenced table). Usually, but it is not required, we use the PRIMARY KEY of the referenced table. In fact, you can get really tricky and have a self referencing foreign key constraint! I don’t want to go into details on that, but just keep it in the back of your mind.

Please don’t use the terms “parent” and “child” tables. These terms refer to pointer chain structures that existed in early non-relational databases. In these older databases, such as IMS, Total, IDMS, et al, your relationships (pointer chains) had to form strict hierarchies. The only restriction on references is that eventually the references must resolve back to a base table.

The real power of the foreign key comes with referential actions that can be added to the clause. The DRI (declarative referential integrity) syntax is

The actions change the referencing table when the referenced table has an event. A database event is an insert, update or delete. We ignore insertions for references. The DRI actions that are possible are:

  • NO ACTION – The default action. If a DELETE or UPDATE is executed on referenced rows, you get an error message about a constraint violation. This is what happens with just the REFERENCES by itself, but it makes things complete. If you use the optional constraint clause, your database will be a little better documented.
  • CASCADE – specifies that the referencing column will be updated when the referenced column is updated, and referencing rows will be deleted when the referenced rows are deleted. When a referenced row is changed, this can change another table with DRI actions. The SQL engine has to keep track of all of this stuff because if anything goes wrong, the engine has to do a ROLLBACK. This can be a lot of overhead and get tricky. Try to keep a simple tree structure in the DRI network.
  • SET DEFAULT – The referencing column(s) will be set to their DEFAULT values when an UPDATE or DELETE is performed on referenced rows. Obviously, the referencing column(s) must have DEFAULT clauses.
  • SET NULL – This action specifies that the column will be set to NULL when the referenced column is updated or deleted. Obviously, the referencing column(s) must be NULL-able.

Let’s try a little different example. You’re given three tables. One is for salesmen. The Sales_Teams table models a group of salespersons, who get a T-shirt and a mascot. The Customers table explains itself. We have a rule that each customer is assigned to both a unique sales team and to an individual salesman who is a member on that team. Our third table gives us that relationship. Each Sales team is responsible for every customer that any member of the team is assigned to and not responsible for the other team’s customers.

With this business rule, we can find all the customers for which a sales team is responsible for. In one of two ways. The direct way would be to use the relationship between sales teams to customers. A more indirect way would be to find sales team to the salespersons, which would give us all the sales persons the team then look for all of the customers belong to each team member.

With these table structures, one way to find out which customers have which sales team would be a query like this:

But going through the sales staff is redundant. We’ve already taken care of the relationships of the teams to the employees with various referential constraints. The query can be reduced to simply:

Entire Tables Can Be Redundant

If you’ve ever worked the book business, then you’ll be familiar with the ISBN (International Standard Book Number). This is an industry standard that has been around for a very long time that identifies a book, with an alpha–numeric code. It was developed in the UK and is based on a mod 11 check digit. Before you ask, the check digit could be zero through nine or X, the Roman numeral for 10.

It is a very strong check digit which prevents a lot of errors. However, it doesn’t fit into the industry standards used for retail barcodes, UPC, and GTIN. We need 13 digits, and the first three digits are always 978, which identifies the trade item as a book. These two systems are referred to as ISBN-10 and ISBN-13 respectively.

Fortunately, it’s easy to convert from ISBN-10 to ISBN-13; add the prefix and replace the original check digit with a simple computation for the new check digit. The problem is that not all books have any kind of ISBN, and a lot of older books that were never reprinted were never given an ISBN-13. Books that did have multiple printings could have both types of ISBNs. Which do you store in your inventory table? Or do you store both? Storing both is redundant and not necessary because they can be computed from each other.

However, imagine that you just bought out your competitor and you’re trying to merger two inventories that use radically different stock numbers. This is the situation in the automobile industry. The same part can be sold under different stock numbers, and generally the cheaper line of cars sells the same parts for less than the more expensive line of cars from the same manufacturer. I think this is called “the law of supply and crooks” in economics. There’s a good business in providing a database with equivalent part numbers.

Dr. Codd and COBOL

When Dr. Codd defined the normal forms, COBOL was really the only language that people did any business programming in. Even today, most business applications are still in COBOL underneath the covers for you can’t see it. Even if you’ve never written COBOL, it’s very much affected the way we do commercial programming. Even today, most of the world’s business data processing is in COBOL under the covers; and even when it isn’t, the COBOL mindset affects the design.

First Normal Form (1NF) is supposed to remove repeated groups. In COBOL, there is a OCCURS (n) operation that says make multiple copies of the hierarchical structure to which it is attached. For example, I might create part of a record declaration (this is like the DDL in SQL) and have it occur seven times, once each for the seven days of the week. The number of occurrences does not have to be a fixed, integer; it can be another field the same record which determines how many repetitions there are when the data is read.

The relational model depends on knowing what the structure of the data is, so this approach to data is not desirable. Think about how you have to define a joint operation in this environment.

Technically, this is what Dr. Codd meant by “repeated group”, but it was later extended to include arrays and anything fancier than a simple scalar atomic value.

Conclusion

Please notice that what were concerned about is logical redundancy. Physical redundancy can be very handy in a database. But that’s another topic.