Simple SQL: Attribute Splitting

If the design of a relational database is wrong, no amount of clever DML SQL will make it work well. Dr. Codd’s Information Principle is that you have, inside the entity tables, the columns that model the attributes of that entity. The columns contain scalar values. Tables that model relationships can have attributes, but they must have references to entities in the schema. You split those attributes at your peril. Joe Celko explains the basics.

If you make the mistake of splitting an atomic data value in some way in the design of a database, you can make life very difficult for anyone doing the subsequent DML. This is one of the reasons why I tell people that 85% to 95% of the work in RDBMS is done in the DDL, not the DML. This is also why we worry about normal forms, and have to work with things that are in at least First Normal Form (1NF). We prefer to get it up to at least Third Normal Form (3NF). Higher normal forms avoid certain data anomalies by their very structure: But these higher normal forms are founded on 1NF. The problem is that too few of the people designing databases are really that clear on what exactly 1NF means.

The usual definition that people learn is that 1NF means there are no “repeating groups” in the columns of the table. This means that SQL does not use arrays, does not use lists of any kind, has no pointer structures, or anything fancy. In particular, we want to have data that can accept Theta operators. ‘Theta’ is a fancy term for comparison operators, or other things that either succeed or fail with simple rules.

A slightly better definition is that all the columns in your table are “scalar values” which is almost the same thing. So what’s a scalar, you may be wondering. These are values that are measured on a scale of some kind. I’ve done articles and presentations on the scales and measurements, and design of encoding schemes. You can catch a presentation from a Posgres conference on these topics:

https://www.youtube.com/watch?v=95mS0v4d1M8.

But today I’d like to go to a more elementary topic. What is the difference between a scalar value and an atomic value? A scalar value is measured on a scale, obviously. But “atomic” comes from the Greek “a-toma” which means without parts (i. e. indivisible). This means that if you pull the encoding of the column apart, you lose some or all of the meaning of the attribute. Let’s make this a little more exact. The ISO tire size is based on the width, the diameter and material of a tire. If you extract any one of those three fields from the encoding, you don’t have a tire anymore. This makes sense; you don’t see a diameter, rolling around by itself; it has to be the diameter of something in particular, a tire. I can use some string manipulations on the ISO tire codes to pull out everything that has “SR” in its code, if I want to group steel radial tires. But I have destroyed the individual entity in the aggregation.

This is also another little piece of SQL lore; the term “field” in ANSI/ISO standard terms refers to these parts of atomic values. The term “field” no longer means part of the physical record which holds our data, like it did in file systems or navigational databases. The reason we went to the trouble of defining columns is that these are much more than just the old physical fields pulled out of physical records. A column can have constraints; it can have a default; it can be referenced by another table in the schema and has to have a strong data type. A field has none of these properties; it gets its meaning from the host program that reads the data.

This means the same field can have completely different meanings in different programs. You can find an old person asked them about hanging magnetic tapes and reading them into the wrong program. My favorite was some medical records that began with the patient’s identifier in the first few columns on each tape record. Somebody mounted a tape of patient records and read it into an imaging program, which also began with the patient ID in the first few columns. The imaging report was quite happy to take the personnel data and turn it into a totally meaningless output at great expense and confusion.

Now don’t think that something that is atomic has to be put in a single column. Consider (longitude, latitude) pairs. This pair locates a single point on the surface of the earth, and you cannot get more atomic than a single geographic point. The test for atomicity is whether the removal of part of it then destroy the semantics. Clearly, having only longitude or only latitude destroys the purpose of giving you a location.

Split by Values within a Column

Most people would not even consider wanting to pair a single column scaled in degrees together with another column that identifies whether those degrees are either a longitude or latitude value. This would require you to keep two rows for every location. However, it’s very common to see a table for a sign in sheet that looks like this:

The check-in and check-out timestamps are a temporal equivalent of the (longitude, latitude) pair, but it’s hard to see that. Perhaps the problem is that people aren’t aware that there exists in the ANSI/ISO standard SQL some interval temporal data types. They are not yet part of transact SQL. The ISO standard temporal model consists of half-open intervals. The starting point is known, but the endpoint is open (for non-math majors, that means the end point is never reached, but only approached is a limit). This is why there is no 24:00:00 Hrs on the clock; this actually becomes 00:00:00 Hrs of the next day. The advantage of half open intervals is that they can be concatenated, subtracted and generally are easier to manipulate then closed intervals.

Now the attribute is brought together, rather than being spread over several rows in the table. As an aside, notice that we can add a check constraint to prevent the check-in time from occurring after the checkout time. If you didn’t know this already, when the checkout time is null, the search condition will still be valid because check constraints do not handle nulls the same way that DML does.

I covered an extreme form of this in an article (A Bit-of-a-Problem/) a few years ago. In the extreme form all the values are given their own bit flag column, usually named “is_<value>”, as if we were still writing assembly language. Nobody took the time to construct a correct encoding scheme. For example, you might have “is_holiday”, “is_workday”, “is_company_holiday” and so forth. A better approach would have been to have a column in the calendar table for the date type, which might be hierarchical so that all of the holiday types are easy to find. And easier to extend without having to doing alter table statements.

While it won’t affect your queries, it’s always a good idea you have a multicolumn atomic value always put them in the same order, wherever you use them. If you’re lucky, this is fairly natural; were used to saying, longitude and latitude in that order, and we feel that the check-in time should be physically positioned before the checkout time.

Split Rows into Tables

In 1994, David McGoveran published an article in Database Programming and Design (volume 7 issue 7) entitled “A New Database Design Principle” which was later reprinted in “Relational Database Writings 1991 – 1994” by Chris Date, ISBN 0-201-82459-0. His first example was two tables, one for pairs of couples that love each other and one for pairs that hate each other.

Presumably, the two tables are disjoint and have no rows in common. Unfortunately, we do not have a way to easily enforce this constraint. We wind up by writing triggers, or if your SQL has it, a CREATE ASSERTION statement.

One of the giveaways with this kind of split is when the tables are isomorphic and we’ve split a relationship status or category attribute out as tables. In McGoveran’s example I’m going to call it a relationship status and put it in a corrected table

Now there’s no danger the same couple can be both lovers and haters. Furthermore, they want to add another relation status, I simply go down to the check constraint and put in “like” or whatever other value I’m going to put in my model. In the original schema, each of those new values would become a separate table, with more complicated triggers or assertions in the schema.

A more common form of this design error comes from mimicking in SQL, what we used to do with magnetic tapes in the old days. We would take an input tape run it through various program and split it out into other tapes based on dates (sales moved to one tape per month), locations (sales moved to one tape per sales district), or categories of some sort. Each of these individual tapes could then be passed on to other programs for reports and we can get the benefit of parallelism.

I would like to point out that tape files are not very good for sharing, or random order search, or doing summaries if they’re not sorted correctly. Thanks to the physical media we had, we had to split it out. That was about 50 years ago. In SQL, with disk storage (magnetic and solid-state) today, there’s no need to do this sort of thing. Yet you constantly see people moving things from table to table, as if they were tapes. The only improvement is that they can do a UNION instead of a slow tape merge.

Split Tables into Tables

This error can be a little more subtle. Remember the goal of a database, any database not just SQL, was to remove redundancy, ensure data integrity and ensure a consistent view of that data. Now imagine you want to figure out the distances between the stores in your company. One way to do it would be to set up a table like this:

The spherical trigonometry distance formula would give us a straight line distance between the stores. It’s easy to compute and it’s very often good enough for government work. But another way to do this would be to look at all pairs of stores, and build a lookup table of all possible pairs and the travel distance.

The first method gives us some extra information, namely, longitude and latitude. This method is also relatively small, which might be a consideration if you have an awful lot of stores. The second method gives us a bigger table, but the actual travel distances. it’s also possible that going from store A to store B is not the same travel time is going from store B to store A.

The decision as to which method you pick depends on the business you are modelling, but my point is that if I have two methods of doing the same task, then one of them is probably redundant. You have split. What should have been one table (your choice) was split into multiple separate tables. This is also dangerous, because you’ve just become like a man who wears two wristwatches. If they agree then fine, if they don’t then you’re never quite sure which is right and you spend an awful lot of time trying to keep them in sync.

Summary

Attributes are much like atoms. You can get into a lot of trouble when you split them. Take the time to look at your schemas, and ask yourself if you got a split somewhere. Is each column actually an attribute of the entities being modeled by the table? Is it contained within the table, either in a single column or in an inseparable subset of columns (which are to be next to each other, in a logical order to make the schema easier to maintain).

Further reading

  1. The DIS-Information Principle: A Splitting Headache
  2. The DIS-Information Principle, Part II