State Transition Constraints

Data Validation in a database is a lot more complex than seeing if a string parameter really is an integer. A commercial world is full of complex rules for sequences of procedures, of fixed or variable lifespans, Warranties, commercial offers and bids. All this requires considerable subtlety to prevent bad data getting in, and if it does, locating and fixing the problem. Joe Celko shows how useful a State transition graph can be, and how essential it can become with the time aspect added.

I did an article called Constraint Yourself! back in October 2008 on how to use DDL constraints to assure data integrity. One of the topics in that piece was a look at state transition constraints via an auxiliary table. I did not go into much detail since this was one of several topics I was covering. I got the feedback that I needed in order to to cover this technique in more depth, so here I am.

1157-BornMarriedDivorcedDead.jpg

I introduced the topic of transition constraints to show how such constraints could be modeled as a state-transition diagram in order to enforce the rules when an entity can be updated only in certain ways.  There is an initial state, flow lines that show what are the next legal states, and one or more termination states. The original example was a simple state change diagram of possible marital states which looked like this:

This state transition diagram was deliberately simplified, but it is good enough to explain principles. To keep the discussion as simple as possible, my table is for only one person’s marital status over his life. Here is a skeleton DDL with the needed FOREIGN KEY reference to valid state changes and the date that the current state started.

What is not shown on it are which nodes are initial states (in this case “Born”) and which are terminal or final states (in this case “Dead”, a very terminal state of being). A terminal node can be the current state of a middle node, but not a prior state. Likewise, an initial node can be the prior state of a middle node, but not the current state. I did not write any CHECK() constraints for those conditions. It is easy enough to write a quick query with an EXISTS() predicate to do this and I will leave that as an exercise for the reader. Let’s load the diagram into an auxiliary table with some more constraints.

An aspect of this problem that I did not consider in the first article is the time dimension. We want to see a temporal path from an initial state to a terminal state. State changes do not happen all at once, but are spread over time. An acorn becomes an oak tree before it becomes lumber and finally my chest of drawers. The acorn does not jump immediately to being a chest of drawers. Some of the changes are controlled by time. I cannot get married immediately after being born, but have to wait to be of legal age. A business offer can expire in a set number of days. You can fill in any number of examples of your own.

For a production system, you would need a more complete set of temporal columns to guarantee that we have no gaps in the history, but this will do for now. We now need a stored procedure to add data to the MyLife table. Here is one solution which is deliberately broken into clear steps for clarity.

The first block of code locates the most recent state of my life, based on the date. The second block of code will insert an initial state if the table is empty. This is a safety feature but there probably ought to be a separate procedure to create the set of initial states. The new state has to be an actual change, so there is a block of code to be sure. The changes have to move forward in time. Finally, we build a row using the most recent state as the new previous state, the input change state and the date. If the state change is illegal, the FOREIGN KEY is violated and we get an error.

If you had other business rules, you could also add them to the code in the same way. You should have noticed that if someone makes changes directly to the MyLife Table, they are pretty much free to screw the data. It is a good ideas to have a procedure that checks to see that MyLife is in order. Let’s load the table with bad data:

This poor guy popped into existence without being properly born , committed bigamy and died twice. And you think your life is tough! Here is a simple validation procedure to catch those errors.

The CTE numbers the steps of the temporal path from an initial node to a middle or terminal node. This chain has to be unbroken which means going from step (n) to step(n+1) has to be a legal change in the StateChanges table. This chain can have only one initial node, so let’s check for that next. Finally, the chain is either still in progress or it has reached a single terminal node.

A little note on the programming technique used. The union of separate queries to do one validation at a time can often be made faster by combining some of the queries. But there are trade-offs; this code is easy to read and to maintain and (hopefully) it will not be run often. It is also hard to get error messages from a single statement. Look back at the ChangeState() procedure; the two IF and RAISERROR() blocks of code could have been converted into CASE expressions that will generate NULLs, folded into the INSERT INTO statement and cause the insertion to fail.

This is not easy to read or to get error messages that tell you if the @in_change_date is invalid in that it violates the time sequence.

The Temporal Side of Changes

What is still missing is the temporal aspect of state changes. In this example, the (‘Born’, ‘Married’) change would have to deal with the minimum age of consent. The (Married’, ‘Divorced’) change often has a legal waiting period. While technically a business rule, you know that no human being has lived over 150 years, so a gap that size is a data error. The terminal and initial states are instantaneous, however. Let’s add more flesh to the skeleton table:

To make up some data, let’s assume that the age of consent is 18 (12 months * 18 years = 216), that you have to wait 3 months into your marriage before getting a divorce, and that you have to be divorced 2 months before you can re-marry. Of course, you can die instantly.

The first question is where to check for temporal violations; during insertion or with validation procedures? My answer is both. Whenever possible, do not knowingly put bad data into a schema so this should be done in the ChangeState() procedure. But someone or something will subvert the schema and you have to be able to find and repair the damage.

Here is a procedure which will tell you what state change in the chain has an improper duration and what the disagreement is.

Inserting a new life change is not a simple matter of putting a (previous_state, current_state, start_date) row into the table. To do it right, you can put conditions into the INSERT INTO statement to cause errors when there is bad data.

A slightly different model will keep a (start_date, expiry_date) pair in the history table. In the case of the MyLife example, the durations were minimums for certain changes. You can get married when you are older than 18 years of age and probably should. But a lot of commercial situations have a fixed lifespan. Warranties, commercial offers and bids expire in a known number of days. This means adding another column to the StateChanges table that tells the insertion program if the expiration date is optional (shown with a NULL) or mandatory (computed from the duration).

Here is some skeleton DDL for a bid application to explain this better.

The DDL has a bid number as the primary key and a new column for the expiration date. Obviously the bid has to exist for a while, so add a constraint to keep the date order right.

The DDL for the state changes gets a new column to tell us if the duration is optional or mandatory. The insertion procedure is a bit trickier. The VALUES clause has more power than most programmers use. The list can be more than just constants or simple scalar variables. But using CASE expressions lets you avoid if-then-else procedural logic in the procedure body.

All it needs is the bid number and what state you want to use. If you don’t give me a previous state, I assume that this is an initial row and repeat the current state you just gave me. If you don’t give me a start date, I assume you want the current date. If you don’t give me an expiration date, I construct one from the State Changes table with a scalar subquery. Here is the skeleton DDL for an insertion procedure.

There are other tricks to assure that there are no gaps between the rows using DDL, but that it another article.