The DIS-Information Principle: A Splitting Headache

Comments 0

Share to social media

In data modeling, we start with a schema that represents our “Universe of Discourse”, which might be a business, for example. Within the schema, we have tables that model sets of entities and relationships. Inside the entity tables, we have columns that model the attributes of the entity. The columns contain scalar values. Tables that model relationships can have attributes, but they must have references to entities in the schema. All the information in the data model has to follow these rules, or an RDBMS does not work. This is known as Dr. Codd’s Information Principle and it is the basis for the relational model.

It sounds pretty simple, neat and clean; but apparently it is harder than it looks. Beginners in any field will tend to make the same kind of errors in their data models and eventually the errors get a name. There is “attribute splitting”, “column overloading” and “EAV” (Entity-Attribute-value) to name a few Data Definition Language (DDL) errors. But the one thing that they have in common is confusing entities, attributes and values and modeling them wrongly.

People do not make random errors; they have a flawed pattern or mindset that leads to the errors. When you learn a new language, you tend to use the syntax and semantics of your old language. Let’s start with an easy one:

Table Splitting

The classic form of table splitting is to take the values of an attribute and make a table for each value. The value is implied in the name of each table. The information principle dictates that each table should be the entire set of one kind of entity. The example that I pull up all the time in order to demonstrate the problem is taking “Personnel” and splitting it into “MalePersonnel” and “FemalePersonnel” tables. The split attribute is the ‘sex code’. It is so extreme that anyone hearing the example can see the problems immediately.

What the symptoms of this problem?

If you are constantly finding that you are forced to do UNIONs in queries or IF-THEN-ELSE logic for Insert, Update and Delete operations, you could have a split table. Especially if the tables that are involved have the same structure.

Split tables evolve to be like amoebas and they keep splitting. Attributes tend to gain, rather than lose, values in their domain. The “MalePersonnel” and “FemalePersonnel” tables seem like the only two choices for a sex code split. But they are not. Corporations and other organizations are “lawful persons” and can show up here. If you hire a temp agency, like Manpower or Kelly Girls, to provide personnel, you will need a third table. If you cannot tell the gender of your new hire “Alex Mack”, you need a gender assignment rule or a fourth table. This is so obviously silly, programmers will not do it. Well, most programmers; never underestimate stupidity.

Look at splits on temporal columns. You will often find a table for each year or month within a year. Why does this look okay? Because that is what we did with magnetic tapes! The old IBM magnetic tape labels used “YYDDD” format that led to some Y2K problems. If you get on SQL Newsgroups, you will find tables that have this or some variant as their names. You will also find tables split on geographical values, organizational units, etc. which can be traced back to a file systems of some kind that organized the data this way in the physical storage media (vertical file cabinets, mag tapes, etc).

This is not the same as partitioning a table. Partitioning is a physical storage management method; the table is on logical unit of work. The SQL engine will maintain the partitioned data and not allow duplicated or dropped data. You do not have to write extra code in order  to assemble the proper schema by hand.

Column-Splitting

Tables are not the only parts of a schema that can be split; so too can columns. The most obvious example of column-splitting is the use of BIT flags in SQL. Old COBOL and assembly language programmers love to do this because it looks like what they had in punch card file systems before RDBMS. A database is an integrated whole and we can determine a fact inside it with a predicate. Files do not work that way. Each file and each step in a sequential process was independent. One step in a process could only communicate with the next steps with a simple flag passed in the data. For example, when you deleted a record in a mag tape file, the record was not immediately destroyed. The front of each record would have an “is_deleted” flag that was set to tell the next program to skip this record.

Unfortunately, this model of data processing was carried forward into newer technology. Consider the BIT flag column is_male” in a table. In SQL Server, BIT is a numeric data type and it is therefore NULL-able, but ignore that problem for now. This looks nice, but then you run into your first corporate entity. Is the XYZ Corporation male or female? Well, neither. If you want to get ridiculous, you could try to assign gender to corporate bodies – Playboy Enterprises is male and Mary Kay Cosmetics is female.

The ISO sex codes are {0= unknown, 1= male, 2= female, 9= lawful persons}, they are standard and they make sense in a business environment. The BIT flag guy has to add another flag – “is-corporate” and now he needs a CASE expression to handle the (is_male, is_corporate) pair. There are (2^2) = 4 binary combinations, so what do we do? How about (0, 0) = unknown because all zeros look good; (1, 0) = male, not corporate; (0, 0) = not male (female?), not corporate; (1, 1) = male lawful person and (0, 1) = female lawful person? None of the flag pairs makes sense! Now we need to add CASE logic to test for a lawful person, then look at gender if it is not a corporate unit.

This gets worse as you add more and more flags. As of mid-2010, I am scaling up a very successful auction system that currently has to find the status of an auction by looking at almost a dozen BIT flags. For example, a bid cannot be both rejected and rescinded. The business rules are not in one place. What should have been a state-change model using (prior_status, current_status) pairs to track the status is convoluted logic hidden in tens of programs. [see: http://www.simple-talk.com/content/article.aspx?article=589].

Function Splitting

The strangest splitting disaster I have seen was done with UDFs (user defined functions) as part of a reporting system in SQL Server from data taken off of an SAP system. The raw skeleton code for the report procedures looks like this (pardon the “fn_’ prefixes, but it was in the original):

The list of UDF calls return one source of income per function, as implied by their names. But these functions are all the same except for one line. That skeleton is:

The account numbers are all the same data type and fall in a range. The whole report could be written as one SELECT statement, something like this skeleton:

It should be no surprise that the original procedure took over 5 minutes to return 48 rows, and the revised version ran almost immediately. The reason given the offender gave for this style of coding was that it made the reports easier to maintain. I am not kidding. I would say that he was trying use OO programming methods on a table and does not understand declarative programming, so he was faking them in a UDF.

Column Overloading

When I have a column in a table, it should be a single attribute measured with a single scale, which explains itself. In English, this means that you do not put your hat size and shoe size in one column and have a second column or rule that tells you which attribute or scale that first number means. You have a hat size column and a separate shoe size column, each with its own scale. This design error is a split in which I have pulled the (value, scale) pair out of a column and put its parts in at least two places.

Please note that “scalar” and “atomic” are not the same things. Scalar values are one-dimensional values on a scale of some kind. I am not going to go into scales and measurements here, but you can read about it in my books Thinking in Sets and Standards in SQL where I have a few chapters on these topics.

The term “atomic” means that the data element cannot be broken up without destroying its meaning. Scalar values are automatically atomic, which is very handy. But the reverse does not hold. The pair (longitude, latitude) is two columns, but it is only one data element (location) by this definition. I could also model a location with a single Hierarchical Triangular Mesh (HTM) code or other vendor extensions for geographical data that will fit into a single column in the table. The idea of a data element is more abstract than just a physical unit of storage or column count, like records in files. I can have a virtual data element that I compute as needed and it has no existence at all.

The most common way to overload a column is make it NULL-able and then allow for the nulls to have multiple meanings. Every data type in SQL has to be NULL-able (This was Dr. Codd’s rule 3: Systematic treatment of NULL values)

note: The Hierarchical Triangular Mesh is a method to subdivide the spherical surface of the globe into triangles of nearly equal shape and size. The HTM gives us a very efficient indexing method for objects localized on the sphere.

Conclusion

Splitting parts tables and columns is only one family of RDBMS design errors. There are plenty of other disasters awaiting the beginning data modeler. Please remember this little series of articles in not about bad DML. There is a lot of bad queries in the world, but they can be replaced with good ones, but DDL design errors effect everything and the best query in the world cannot save you.

The topic of EAV (Entity Attribute Value) design both so common and is so awful it needs its own article. With painfully detailed examples.

The use of 1:1 relationships is not well understood. People want to cram all the entities in the relationship into one table.  Each part only appears once, doesn’t it?

Let me surprise you with the rest of this series.

Load comments

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.