The Reference Data Pattern: Extensible and Flexible

Having reference tables in your database is no big deal, right? You just need to tie a code or ID with a description for each reference type. But what if you literally have dozens and dozens of reference tables? Is there an alternative to the one-table-per-type approach? Read on to discover a generic and extensible database design for handling all your reference data.

Having reference tables in your database is no big deal, right? You just need to tie a code or ID with a description for each reference type. But what if you literally have dozens and dozens of reference tables? Is there an alternative to the one-table-per-type approach? Read on to discover a generic and extensible database design for handling all your reference data.

A bird’s-eye view of a logical data model (LDM) for an enterprise system

This unusual-looking diagram is a bird’s-eye view of a logical data model (LDM) containing all the reference types for an enterprise system. It’s from an educational institution, but it could apply to the data model of any kind of organization. The bigger the model, the more reference types you’re likely to uncover.

By reference types I mean reference data, or lookup values, or – if you want to be flash – taxonomies. Typically, the values defined here are used in drop-down lists in your application’s user interface. They may also appear as headings on a report.

This particular data model had about 100 reference types. Let’s zoom in and look at just two of them.

class reference-ldm-detail

From this class diagram, we see that all reference types extend the Root_Reference_Type. In practice, this just means that all our reference types have the same attributes from Alt_Sequence thru to Type_Key inclusive, as shown below.

Attribute Description
Alt_Sequence Used to define an alternative sequence when a non-alphabetic order is required.
Description The description of the type.
Effective_Period Effectively defines whether or not the reference entry is enabled. Once a reference has been used it cannot be deleted due to referential constraints; it can only be disabled.
Pretty_Name The pretty name for the type. This is what the user sees on screen.
Type_Key The unique internal KEY for the type. This is hidden from the user but application developers can make extensive use of this in their SQL.

The type of party here is either an organization or a person. The types of gender are male and female. So these are really simple cases.

The Traditional Reference Table Solution

So how are we going to implement the logical model in the physical world of an actual database?

We could take the view that each reference type will map to its own table. You might refer to this as the more traditional one-table-per-class solution. It’s simple enough, and would look something like this:

The down-side of this is that there could be dozens and dozens of these tables, all having the same columns, all doing very much the same thing.

Furthermore, we may be creating a lot more development work. If a UI for each type is required for administrators to maintain the values, then the amount of work quickly multiplies. There are no hard and fast rules for this – it really depends on your development environment – so you’ll need to talk to your developers to understand what impact this has.

But given that all our reference types have the same attributes, or columns, is there a more generic way of implementing our logical data model? Yes, there is! And it only requires two tables.

The Two-Table Solution

The first discussion I ever had about this subject was back in the mid-90’s, when I was working for a London Market insurance company. Back then, we went straight to physical design and mostly used natural/business keys, not IDs. Where reference data existed, we decided to keep one table per type that was composed of a unique code (the VARCHAR PK) and a description. In point of fact, there were far fewer reference tables then. More often than not, a restricted set of business codes would be used in a column, possibly with a database check constraint defined; there would be no reference table at all.

But the game has moved on since then. This is what a two-table solution might look like:

As you can see this physical data model is very simple. But it’s quite different from the logical model, and not because something has gone all pear-shaped. It’s because a number of things were done as part of physical design.

The reference_type table represents each individual reference class from the LDM. So if you have 20 reference types in your LDM, you’ll have 20 rows of meta-data in the table. The reference_value table contains the permissible values for all the reference types.

At the time of this project, there were some quite lively discussions between developers. Some favored the two-table solution and others preferred the one-table-per-type method.

There are pros and cons for each solution. As you might guess, the developers were mostly concerned with the amount of work the UI would take. Some thought that knocking together an admin UI for each table would be pretty quick. Others thought that building a single admin UI would be more complex but ultimately pay off.

On this particular project, the two-table solution was favored. Let’s look at it in more detail.

The Extensible and Flexible Reference Data Pattern

As your data model evolves over time and new reference types are required, you don’t need to keep making changes to your database for each new reference type. You just need to define new configuration data. To do this, you add a new row to the reference_type table and add its controlled list of permissible values to the reference_value table.

An important concept contained in this solution is that of defining effective periods of time for certain values. For example, your organization may need to capture a new reference_value of ‘Proof of ID’ that will be acceptable at some future date. It is a simple matter of adding that new reference_value with the effective_period_from date correctly set. This can be done in advance. Until that date arrives, the new entry will not appear in the drop-down list of values that users of your application see. This is because your application only displays values that are current, or enabled.

On the other hand, you may need to stop users from using a particular reference_value. In that case, just update it with the effective_period_to date correctly set. When that day passes, the value will no longer appear in the drop-down list. It becomes disabled from that point on. But because it still physically exists as a row in the table, referential integrity is maintained for those tables where it has already been referenced.

Now that we were working on the two-table solution, it became apparent that some additional columns would be useful on the reference_type table. These mostly centered on UI concerns.

For example, pretty_name on the reference_type table was added for use in the UI. It is helpful for large taxonomies to use a window with a search function. Then pretty_name could be used for the title of the window.

On the other hand, if a drop-down list of values is sufficient, pretty_name could be used for the LOV prompt. In a similar way, description could be used in the UI to populate roll-over help.

Taking a look at the type of config or meta-data that goes into these tables will help clarify things a bit.

How to Manage All That

While the example used here is very simple, the reference values for a large project can quickly become quite complex. So it may be advisable to maintain all of this in a spreadsheet. If so, you can use the spreadsheet itself to generate the SQL using string concatenation. This is pasted into scripts, which are executed against the target databases that support the development life-cycle and the production (live) database. This seeds the database with all the necessary reference data.

Here is the config data for the two LDM types, Gender_Type and Party_Type:

There is a row in reference_type for each LDM subtype of Root_Reference_Type. The description in reference_type is taken from the LDM class description. For Gender_Type, this would read “Identifies the gender of a person”. The DML snippets show the differences in descriptions between type and value, which may be used in the UI or in reports.

You will see that reference_type called Gender_Type has been allocated a range of 13000000 to 13999999 for its associated reference_value.ids. In this model, each reference_type is allocated a unique, non-overlapping range of IDs. This is not strictly necessary, but it allows us to group related value IDs together. It kind of mimics what you’d get if you had separate tables. It’s nice to have, but if you don’t think there’s any benefit in this then you can dispense with it.

Another column that was added to the PDM is admin_role. Here’s why.

Who Are The Administrators

Some taxonomies can have values added or removed with little or no impact. This will occur when no programs make use of the values in their logic, or when the type is not interfaced to other systems. In such cases, it is safe for user administrators to keep these up to date.

But in other cases, much more care needs to be exercised. A new reference value may cause unintended consequences to program logic or to downstream systems.

For example, suppose we add the following to the Gender Type taxonomy:

This quickly becomes a problem if we have the following logic built in somewhere:

Clearly, the “if you’re not male you must be female” logic no longer applies in the extended taxonomy.

This is where the admin_role column comes into play. It was born from discussions with the developers on the physical design, and it worked in conjunction with their UI solution. But if the one-table-per-class solution had been chosen, then reference_type would not have existed. The meta-data it contained would have been hard-coded into the application Gender_Type table – , which is neither flexible nor extensible.

Only users with the correct privilege can administer the taxonomy. This is likely to be based on subject matter expertise (SME). On the other hand, some taxonomies may need to be administered by IT to allow for impact analysis, thorough testing, and for any code changes to be harmoniously released in time for the new config. (Whether this is done by change requests or in some other way is up to your organization.)

You may have noted that the audit columns created_by, created_date, updated_by, and updated_date are not referenced at all in the above script. Again, if you’re not interested in these you don’t have to use them. This particular organization had a standard that mandated having audit columns on every table.

Triggers: Keeping Things Consistent

Triggers ensure that these audit columns are consistently updated, no matter what the source of the SQL (scripts, your application, scheduled batch updates, ad-hoc updates, etc.).

Tools in this post

Redgate Data Modeler

Design, update, and manage database schemas

Find out more