Database Normalization Basics

The task of Database Normalization doesn't have to be painful, especially if you follow Old Mother Celko's Normalization Heuristics.

Have you reached the point of having seen the term “normalization” used database literature, but you are still unsure as to just what you have to do to get a normalized database? If so, then you’re not alone.

We will be showing, in this article, how  Normal Forms keep data integrity. Some people will argue that it is all right to “denormalize” a schema for efficiency reasons. If we denormalize the schema, then we are either not enforcing some of the user-required constraints or we have to write addition code to enforce the constraint. If we choose to write additional code, we have to duplicate the functionality in every application, present and future. Normalization reduces programming effort; rules are enforced in one place, one way, one time.

Let’s start in a simple way with some general heuristics about normalization.

Heuristics are exact and often sound a little funny when you say them. One of my favorites is “if you cannot think of it as a percentage, think of it as a ratio!” (From a mathematical view point, a ratio and percentage are the same. But percentages have a base while the two things in a ratio are seen as equal in some sense).

Mother Celko’s Thirteen Normalization Heuristics

  • Does the table model either a set of one and only one kind of entity or one and only one relationship. This is what I call disallowing a “Automobiles, Squids and Lady GaGa” table. Following this rule will prevent ‘Multi-valued dependencies’ (MVD) problems and it is the basis for the other heuristics.
  • Does the entity table make sense? Can you express the idea of the table in a simple collective or plural noun? To be is to be something in particular; to be everything in general or nothing in particular is to be nothing at all (this is known as the Law of Identity in Greek logic). This is why EAV does not work – it is everything and anything.
  • Do you have all the attributes that describe the thing in the table? In each row? The most important leg on a three-legged stool is the leg that is missing.
  • Are all the columns scalar? Or is a column serving more than one purpose? Did you actually put hat size and shoe size in one column? Or store a CSV list in it?
  • Do not store computed values, such as (unit_price * order_qty). You can compute these things in VIEWs or computed columns.
  • Does the relationship table make sense? Can you express the idea of the table in a simple sentence, or even better, a name for the relationship? The relationship is “marriage” and not “person_person_legal_thing”
  • Did you check to see if the relationship is 1:1, 1:m or n:m? Does the relationship have attributes of its own? A marriage has a date and a license number that does not belong to either  of the people involved. This is why we don’t mind tables that model 1:1 relationships.
  • Does the entity or relationship have a natural key? If it does, then you  absolutely have to model it as the PRIMARY KEY or a UNIQUE constraint. Is there a standard industry identifier for it? Let someone else do all that work for you.
  • If you have a lot of NULL-able columns, the table is probably not normalized.
  • The NULLs could be non-related entities or relationships.
  • Do the NULLs have one and only one meaning in each column?
  • If you have to change more than one row to update, insert or delete a simple fact, then the table is not normalized.
  • Did you confuse attributes, entities and values? Would you split the Personnel table into “Male_Personnel” and “Female_Personnel” by splitting out the sex code? No, sex is an attribute and not an entity. Would you have a column for each shoe size? No, a shoe size is a value and not an attribute.

The Writing on the Wall

Many years ago, there was a magazine named Database Programming & Design which published a poster on Normalization by Marc Rettig as a subscription renewal premium. It was so popular that it is still around after all this time. It follows one example, the Daisy Hill Puppy Farm database (this is the birth place of Snoopy in the Peanuts comic strip).

You really ought to have a copy hanging in your cube. It is that good. It is called ‘5 Rules of Database Normalization’. You can now find it here:
 http://www.databaseanswers.org/downloads/Marc_Rettig_5_Rules_of_Normalization_Poster.pdf

The Relational Model

The Relational Model and Normal Forms of the Relational Model were first defined by Dr. E. F. Codd then extended by other writers after him. He borrowed the term “normalized relations” from the US and Soviet political jargon back when the USSR still existed.

Relations are a branch of mathematics which deal with mappings among sets defined by predicate calculus from formal logic. Just like an algebraic equation there are many forms of the same relational statement but the “normal forms” of relations are certain formally defined desirable constructions. The goal of Normal Forms is to avoid certain data anomalies that can occur in un-normalized tables. Data anomalies are easier to explain with examples. When Dr. Codd defined the relational model it added some additional requirements to the visualization of the relation as a table:

  1. If a row in a table can exist without a certain attribute being known, then this is considered a NULL state for that attribute and the DBMS must use a special symbol for that state. We have that in SQL, but NULLs have a data type to signal the SQL engine how to handle the physical storage.
  2. No two rows are identical. Furthermore it must be possible to define a subset of the known attributes of a relation which have no identical values. This subset of attributes is known as a “key”. SQL was built on old file systems, and the language can allow redundant duplicates. But that is a sign of really bad programming
  3. One key in a relation is the “primary key”. This key cannot have any NULL attributes, of course. This initial definition was later modified to regard all keys as “equally key-like”, but the notion was set in the early literature and it got into SQL. The actual motive was from sequential files; a sequential tape file has to be sorted on a sort key to work. ISAM files on disk also had to have a sort key. This is why T-SQL has clustered indexes and and clustering as the default for the PRIMARY KEY. The new technology mimics to old technology, until it learns better and has its own voice.
  4. The order of rows and columns are irrelevant and do not contain any information. This leads us to the requirement that any atomic fact in the database (a collection of such tables) can be found by a table name, column names, and a primary key value.

    Again, SQL assumes that it can go to the ANSI/ISO Standard Schema Information meta-data tables and get a default list  of columns to fill in the column lists in various statements. It is a shorthand.

    Anyone tackling relational databases who  is familiar only with file-based data will assumes that there is a physical ordering because that is how all of the flies he has ever worked with are stored — physically contiguous tracks on a disk. Nope. Wrong. Columnar databases are not contiguous storage. Neither is a hashed database. The first thing you have to do to learn SQL is throw out a physical model mindset and think in abstractions.

Dr. Codd added quite a few other requirements (initially a total of 12 rules and eventually 333) but those listed here will suffice for our discussion.

Anomalies

If you do not normalize a table, you get data anomalies. These are situations where data integrity cannot be maintained without extra code, NULLs and convoluted programming. Let’s create a simple example. The table holds information about college students, their majors and their departmental advisers.

UPDATE Anomaly

Let’s try to update Higgins to change from Computer Science to Sanskrit and suddenly Celko is advising Sanskrit!

The actual erroneous row is (‘Higgins’, ‘Sanskrit’, ‘Celko’)

DELETE Anomaly

Mr. Smith drops out of school and Professor Wilson loses his job!!

The row ('Smith', 'English', 'Wilson') happens to be the only row that tells us that Professor Wilson is the English department’s faculty adviser. If this sample table and the English department had been bigger, this problem could have been hidden for a long time. But we would have been carrying redundant data.

INSERT Anomaly

Mr. F. L. Wright starts an Architecture department, but we cannot put him into the database without using NULLs or a dummy student name.

SQL will prevent this row from getting into the table the ways I declared this example. It is bad practice to  declare an IDENTITY as the key and all other columns are NULL-able. Since the table’s IDENTITY property is not a column nor an attribute of any possible data model, it cannot ever be used as a key. It is the count of the physical insertion attempts.  You can use it to mimic the record numbers that are so familiar from using sequential files, but they have no place in the relational model.

First, Second, Third, and Boyce-Codd Normal Forms

Let’s talk about going from a file system to an RDBMS. Let’s start with a file to maintain data about class schedules at an imaginary school. We keep the course_nbr, section_nbr, time_period, room_nbr, professor_name, student_name, student_major, and course_grade. Assume we started with a Pascal file, declared like this, and wanted to move it to a relational database. I picked Pascal because it is easy to read and understand, unlike currently popular languages.

To convert this sequential file of records into a table, we first have to “flatten” it out, making sure that each row has full information for all the columns. Since Pascal, or any other language, does not have NULLs, the columns will be NOT NULL. The first thought is to simply rewrite every field over as a column, a line-for-line translation, as I said earlier. It does not really work because a column is nothing like a field; did you notice that “Students” is itself an array of records whose size is determined by an external variable? This would have been done with an OCCURS clause in COBOL and other programming language have similar features.

But let’s do anyway.

We declare PRIMARY KEY (course_nbr, section_nbr, student_name) so we have a proper table. But what we are doing is hiding the Students record array, which has not changed its nature by being flattened.

To get this table in 1NF we “decompose” it into two or more tables so that no table has repeating groups, hidden or otherwise.

Since we were told that many students could sign up for a single course we know that the student information was a repeating group. Note that the foreign key declaration allows us to enforce the constraint that a student may only sign up for a valid course and section. Files do not do that; they are raw data without any constraints that get meaning from the host program that reads them.

With the second table structure there are no repeating groups but there are other problems. For example, what if a student changes his/her major? We need to change student_major for every (course_nbr, section_nbr) that student_name is enrolled in.

We now need to introduce the concept of a functional dependency (FD) or determiner to borrow a British term for the same idea. The notation is “X â Y”, and the arrow is read “X determines Y” in English, where X and Y are sets of attributes that will be used to define columns in tables. A functional dependency means that if I know a value of X, then I can determine a value of Y, just as you would in a mathematical function.

A table is in Second Normal Form (2NF) if it has no “partial key” dependencies. That is, if X and Y are columns or subsets of columns and X is a key then for any Z which is proper subset of X, it cannot be the case that Z â Y. In the example, our users tell us that knowing the (student_name, course_nbr) is sufficient to determine the section_nbr (since students cannot sign up for more than one section of the same course) and course_grade. This is the same as saying that (student_name, course_nbr) â (section_nbr, course_grade). After more analysis we also discover from our users that student_name â student_major; a student has only one major, in English. This leads us to the following decomposition:

At this point, we are in 2NF. Every attribute depends on the entire key. Now if a student changes majors it can be done in one place. Furthermore, a student cannot sign up for different sections of the same class because we changed the key of Enrollment. Unfortunately, we have still have problems.

Notice that while room_size depends on the entire key of Classes it also depends on room_nbr. If the room_nbr is changed for a (course_nbr, section_nbr) we may also have to change the room_size and if the room_nbr is modified (we knock down a wall) we may have to change room_size in several rows in Classes for that room_nbr. Obviously, the room size is a property of the room.

Another normal form can address these problems. A table is in Third Normal Form (3NF) if for all X â Y where X and Y are columns of a table then X is a key or Y is part of a “candidate” key (a key which is not necessarily the primary key). This implies that the table in 2NF since a partial key dependency is a type of transitive dependency. To get our example into 3NF we make the following decomposition:

A common misunderstanding about relational theory is that 3NF has no transitive dependencies. As indicated above, if X â Y then X does not have to be a key. If Y is part of a candidate key. We still have a transitive dependency in the example - (room_nbr, time_period) â (course_nbr, section_nbr) – but since the right side of the dependency is a key in is technically in 3NF. The unreasonable behavior this table structure still has is that several courses can be assigned to the same room_nbr at the same time_period.

Boyce-Codd Normal Form (BCNF)

The normal form that actually removes all transitive dependencies is Boyce-Codd Normal Form or BCNF. A table is in BCNF if for all X â Y, X is a key; period. We can go to this normal form just by adding another key with UNIQUE (room_nbr, time_period) constraint clause to the table Classes. There are some other interesting and useful “higher” normal forms but that is out of the scope of this discussion. In our example we have removed all of the important anomalies with BCNF.

A table can be in 3NF, but still have problems. Let’s declare an abbreviated table for mailing labels.

The nature of US mailing addresses and the ZIP code gives us these two functional dependencies, ignoring the fact that many cities have the same name within different states.

The first FD can give us a key, but we are not enforcing the second one. Consider this:

In this example, we have to add another table to implement the second FD. Notice the overlapping unique constraints.

and modify the original table:

This is a bit elaborate, but it is all declarative code. Using overlapping constraints is a good technique that does not get used as often as it should. Let’s go to the next normal form. BCNF removes all the transitive dependencies, not 3NF.

Fourth Normal Form

A table in BCNF can still have problems with multi-valued dependencies. Here is another overly simple table.

It is all key, so we have a BCNF table. Now let’s enforce the following two multi-valued dependencies.

The MVDs say that an item can be in one or more departments — every department has trash cans. The second multi-valued dependencies say that an item can have different suppliers — trash can come from many suppliers. But there is a problem with this table. Imagine that I get a new shiny red trash can. I can not do anything with it until I find some department that wants a shiny red trash can. I need to recognize that MVD constraints on the departments and suppliers are independent.

 Fifth Normal Form

You can have a n-ary relationship that cannot be decomposed into lower degree relations. If you ever have had a mortgage or made any major purchase that involved a lender, then this Shelton will look familiar.

Let’s try to split this into three binary relations.

The trouble is that when you try to assemble the original information by joining pairs of these three tables together, thus:

This will add false rows, such as (‘Smith’, ‘Jones’, ‘Home Bank’). These three tables should have been implemented as VIEWs and not as base tables

 Other Normal Forms and non-NF Redundancies

There are more normalizations, but they involve temporal consideration and other things. There are also redundancies that involve more than one table. The most common non-normal form redundancy is a summary column in a referenced table. In English, imagine the usual skeleton of an Orders table and its referencing order details table:

where order_amt_tot is the total amount of the order as computed by the query:

People who write an Orders table like this will be updating it constantly, often with procedural code in a TRIGGER or stored procedure.

So, do we have to learn a lot of math just to get a correct schema? Well, yes. But you can use some basic heuristics and feel that you have gotten it right 95% of the time