The DIS-Information Principle, Part II

Database design simply involves populating a schema with tables that model sets of entities and relationships. A table will contain Columns that model an entity's attributes and contain scalar values. What could go wrong? Plenty, unfortunately, when these simple principles are misunderstood or flouted, and Joe continues to itemise bad design practices that can cause subsequent grief for the application developers.

We’ve covered the a series of bad practices in database design that come from the habit of ‘splitting’  tables and columns. This  is only one family of RDBMS design errors. There are plenty of other disasters awaiting the data modeler. You can always fix bad  queries by replacing them with good ones, but DDL design errors effect everything and the best query in the world cannot save you.

If I can touch it, it is probably an entity.
If I have to describe it with a verb of being,
 it is probably a relationship.
If I measure it on a scale, it is probably an
attribute.

The principles behind the design of relational databases  sound pretty simple, neat and clean; but apparently the task is harder than it looks. Beginners in any field will tend to make the same kind of errors in their data models and eventually the errors get a name. There is “attribute splitting”, “column overloading” and “EAV” (Entity-Attribute-value) to name a few Data Definition Language (DDL) errors. But the one thing that they have in common is confusing entities, attributes and values and modeling them wrongly.

 

Misplaced Columns

Putting a column in the wrong table sounds almost absurd, but it is common. The most extreme example of which I heard was on an interview question which asked for a skeleton of the classic Sybase /SQL Server Bookstore schema. The failed candidate put the price of the book as an attribute of the Buyer. Hey, he pays it, doesn’t he?

Nope, the price of a book is clearly an attribute of the book. Duh. It might be modified by the buyer because he has some special status (a buyer attribute), but that is another issue.

I will guess that this is most often from having a 1:1 relationship that gets embedded into a one or both tables. If you have a table of husbands (husband_id as its key), then each guy must have a wife. That is what makes him a husband, right? So we add a wife column to the table. Likewise, if you have a table of wives (wife_id as its key), then each gal must have a husband. That is what makes her a wife, right? So we add a husband column to the table. Now there is a cycle where the two tables reference each others key.

The problem is a failure to see that a marriage is a relationship and needs its own table. The (husband_id, wife_id) pair is a natural key and they reference the two tables involved. Because a marriage has other attributes (wedding dates, license number, presiding official, etc.), this is easy to see. But if the relationship has no other attributes, it simply gets put in the wrong place.

Inconsistent Data Element Modeling

This nightmare is having something mutate from a value, entity or attribute in the same schema to something else.

A recent posting on a newsgroup had two tables that look like this. Well, since the original poster never bothered to give DDL and thought we could understand his narrative and personal programming language. He also did not know that movies are identified by the ISAN (International Standard Audiovisual Number), the way that a VIN identifies a vehicle. You can Google (http://www.isan.org/). But one example is ISAN 0000-3BAB-9352-0000-G-0000-0000-Q. Yes, it is weird, long and ugly but it is now required by law in many countries and therefore in common usage in that industry.

His first table was the release dates for five particular movies, thus:

Does the idea that a movie title is an attribute of a date seem weird to you? Me too. The second table is also strange. Here is a guess at the DDL he never posted, based on vague sample data.

Did you notice that a movie title has changed from an attribute to a value taken from the domain {‘movies1’, ‘movies2’, ‘movies3’, ‘movies4’, ‘movies5’}? Does that make sense? Nope! The title has to an entity, attribute or value everywhere in the data model.

His challenge was to find out which movies are released on which dates and in which cities. He insisted that he needs both tables because the release dates in the first table must match to locations in the second table in some way. The schema should have been something like this:

You do not just throw a movie out the door to no place in particular on Tuesday. You might have cities bidding for it or unknown premiere dates, so you make so you make the future time/space attributes NULL-able in the worst case. And that would change my guess at the key for that table. Without better specs I can only guess.

What he showed was something like a bad calendar or spreadsheet printout. In the actual posting, he was concatenating movie titles with a date in a cell in a spreadsheet!

Schema Splitting

The extreme case is putting data into multiple databases. Before we had federated databases, this was done out of necessity most of the time. It was not always physically or practically possible to put all of the enterprise data into one centralized database. A federated database is similar to a partitioned database in that it has the power to present a unified logical view of the schema from a disjoint physical system.

In a federated database, the system maintains the relationships among the databases and (we hope) keeps the data consistent (i.e., every DB is on UTC, uses the same set of measures, the same data element names, etc.) so that the data belongs to one central data model and not a bunch of local data models that do not match.

EAV (Entity-Attribute-Value) Tables

Entity-Attribute-Value models have the user doing meta-data in the database. He produces things that belong in the schema information tables, but without any data integrity. The excuse for it is that you can make a database “on the fly” when you have not done any research and do not care about data integrity.

This is an actual example from an old newsgroup posting, with a little clean up. The table names are very generic and describe meta-data. The first table is simply EAV, and it contains the name of a column (attribute) and the possible values that are allowed in that column.

Since we wanted to allow NULLs, the attribute value column has to be NULL-able. But the key_col column has to be repeated and cannot be a key. We also don’t know anything about data types because we have no idea what any particular attribute might be until insertion time. Hey, we can cast VARCHAR(n) to any data type we need, so everything — dates, numeric values, etc.,– becomes a string. I am violating an EAV design rule by only making that column 50 charters wide; since you might want to store a huge string, you should use the longest possible string for everything. Lets load this table.

Now we discover that this database has to do with domestic violence. Finding out what your database does after you create it is not a good design principle. But do not worry, if you want to put a snake farm management system into this table!

Now we need some data for this thing. Not particular, specific data, but vague generic data which does into a second table called EAV_Data. There is no such thing as a generic key that identifies all things in creation. That is a belief from Kabbalah numerological mysticism, in which God puts a number on everything in Creation. We have the equally mystical IDENTITY table property instead, so EAV tends to use it whenever it needs a key.

The query we want is a simple count of the events by location. Here is the sample data in a grid:

Since this a lot of zeros, another query that drops the zeros out of the results:

Here is an answer From: Thomas Coleman

This has subquery expressions nested inside each other to do the simplest task because each base table in a proper design has to be built on the fly, over and over. Her is the same query in a proper schema with a table for Locations and one for Events.

The column names are seriously painful. Don’t use reserved words like “key” and “value” for column names. It means that the developer *has* surround the column name with double quotes for everything. And they are too vague to be data element names anyway!

     “To be is to be something
in particular; to be nothing
in particular is to be nothing.”
 –Law of Identity

All data integrity is destroyed. Any typo becomes a new attribute or entity. Entities are found missing attributes, so all the reports are wrong.

Try to write a single CHECK() constraint that works for all the attributes of those 30+ entities your users created because you were too dumb or too lazy to do your job. It can be done! You need a case expression almost 70 WHEN clauses for a simple invoice and order system when I tried it as an exercise.

Try to write a single DEFAULT clause for 30+ entities crammed into one column. Impossible!

Try to set up DRI actions among the entities. If you thought the WHEN clauses in the single CASE expression were unmaintainable, wait until you see the “TRIGGERs from Hell” — Too bad that they might not fit into older SQL Server which had some size limits. Now maintain it.

For those who are interested, there are a few of links to articles that I found on the net:

A really good horror story about this kind of disaster is here.

Conclusions

It is not that hard to design a good schema. It is mostly common sense and simple questions about the nature of the things in your universe of discourse. If I can touch it, it is probably an entity. If I have to describe it with a verb of being, it is probably a relationship. If I measure it on a scale, it is probably an attribute. Those simple questions will cover 80-85% of your real-world design situations. After that, you have to think about things a little more.