A look at algorithms used in RDBMS implementations of DWH systems

When you’re using a data warehouse, some actions get repeated over and over. We will take a look at four common algorithms used to deal with these situations.

When you’re using a data warehouse, some actions get repeated over and over. We will take a look at four common algorithms used to deal with these situations.

Most DWH (data warehouse) systems today are on a RDBMS (relational database management system) platform. These databases (Oracle, DB2, or Microsoft SQL Server) are widely used, easy to work with, and mature – a very important thing to bear in mind when choosing a platform. Although they include special features for DWH configuration (like bitmap indexes, partitioning, and materialized views) we still have to take care of data transformation ourselves.

In this article, we will examine four algorithms for working with the data in RDBMS DWH systems. As much of DWH development revolves around temporal-sensitive data, most of the algorithms will deal with changing data in that way.

All of the techniques we will consider are written in pure SQL, not procedural SQL (like PL/SQL). When writing batch-oriented algorithms, it is always more prudent to write in pure SQL. This is because SQL is translated via an execution optimizer (i.e. Oracle CBO) which chooses the optimal execution and order of sub-operations for the query. How CBO does this is a fascinating subject; perhaps we will explore it a bit more in an upcoming article. For now, let’s stick with the algorithms.

We will start with the simple stuff and slowly increase the complexity.

1. Adding Information to a Periodic Snapshot Fact Table

First of all, let’s see how to transform a periodic snapshot fact table with fixed time intervals to include Type 2 information that is updated over time.

To refresh your knowledge of fact table types and organizations, see my previous article. We will use the same fact tables here.

So our first table is fact_account_snapshot and our fact table (with Type 2 technical columns) is fact_account_snapshot_period. They hold the same information, but there’s a difference in how they organize the data. The fact_account_snapshot table contains data for every snapshot date, while fact_account_snapshot_period stores data only when it has changed. Both hold account balance information, so in our case we will have a couple of accounts with balances that have changed through time.

To transform the rows and fill the fact_account_snapshot table from the period table, all we need to do is expand the monthly period of the fact_account_snapshot_period table with every snapshot date contained in that period.

Monthly period to snapshot

Picture 1: Monthly period to snapshot

To do this, we need to connect the fact_account_snapshot_period to a time dimensional table which contains the snapshot dates. Let’s say that this table is called dim_time. (We should mention that fact_seq represents the ID sequence of the fact_account_snapshot table.) The SQL statement for the transformation is simple:

Note: If the upper bound of the date_to column is NULL for a high date (such as 31.12.9999), you will need to include an NVL or CASE statement to limit date_to to the max snapshot date you require.

With this type of organization, we are sacrificing space for speed. The new organization takes more memory, but it requires less from the CPU.

2. From Snapshot to Type 2 Organization

Now if you do mind your tables taking a lot of space and you have some snapshot fact tables, you can transform them to a Type 2 organization. This is the inverse of the use case we applied in the first algorithm.

To start, we need to transform the data from the snapshot version to the monthly Type 2 version.

Monthly snapshot to period

Picture 2: Monthly snapshot to period

As this algorithm is more complex, we will break it into three sections. Each section corresponds to one of the three nested SELECT blocks which we will tie together later.

So let’s begin.

First, we need a simple data extract of the first ten rows from the fact_account_snapshot table. Here’s the SQL:

The data we get is:

ACCOUNT_ID BALANCE_AMOUNT TIME_ID
1000 0 20010201
1000 0 20010301
1000 0 20010401
1000 9000 20010501
1000 0 20010601
1000 0 20010701
1000 9000 20010801
1000 0 20010901
1000 0 20011001
1000 9000 20011101

On this data set, we will flag the difference between the current row and the lagging row if there is a difference in the balance_amount. The code looks like this:

The data we get is:

ACCOUNT_ID TIME_ID BALANCE_AMOUNT LAG_BALANCE_AMOUNT FLAG
1000 20010201 0 0 0
1000 20010301 0 0 0
1000 20010401 0 9000 1
1000 20010501 9000 0 1
1000 20010601 0 0 0
1000 20010701 0 9000 1
1000 20010801 9000 0 1
1000 20010901 0 0 0
1000 20011001 0 9000 1
1000 20011101 9000 10000 1

As we see here, the flag is set to return a “1” when the previous value of balance_amount for the specific account_id is different from the current value.

Next, we construct groups of rows with same balance_amount values.

The rows we saw earlier are now flagged by flag:

ACCOUNT_ID TIME_ID BALANCE_AMOUNT FLAG_GROUP
1000 20010201 0 0
1000 20010301 0 0
1000 20010401 0 0
1000 20010501 9000 1
1000 20010601 0 2
1000 20010701 0 2
1000 20010801 9000 3
1000 20010901 0 4
1000 20011001 0 4
1000 20011101 9000 5

Hopefully, you can see the solution on the horizon. All that is left for us to do is to reduce the rows by the group number. We can do this reduction with a simple GROUP BY() statement:

Note: We added one month to the calculation of the date_to column so we could get a nice lower bound inclusive and upper bound exclusive time period.

ACCOUNT_ID BALANCE_AMOUNT DATE_FROM DATE_TO
1000 0 02/01/01 05/01/01
1000 9000 05/01/01 06/01/01
1000 0 06/01/01 08/01/01
1000 9000 08/01/01 09/01/01
1000 0 09/01/01 11/01/01
1000 9000 11/01/01 12/01/01
1000 10000 12/01/01 01/01/02

The data is now represented with a time period; in our particular case the ten rows are now condensed to seven rows.

3. Refreshing Batch Data in a Type 2 Snapshot Fact Table

The most common algorithm in the DWH world inserts a new delta of data into a Type 2 table. This can be a dimension or a snapshot fact table, as long as it’s of Type 2 organization.

Delta of data is any data that has changed or been added since a certain time. The difference is calculated in the ETL process. Specifically, we’re searching for new or changed data that needs to be traced in a dimensional table or measured in a fact table.

There are lots of ways we can implement this change. Let’s look at one fun example using analytical functions. For this example, we will return to our old friend, the customer dimension table dim_customer.

Table dim_customer

We will take a ten-row extract from the table:

ID CUST_NATURAL_KEY FIRST_NAME LAST_NAME ADDRESS_RESIDENCE CITY_RESIDENCE COUNTRY_RESIDENCE DATE_OF_BIRTH MARITAL_STATUS GENDER VALID_FROM VALID_TO
1 30983290067 Mirko Vladovic 5 Brda Split Croatia 09/03/83 N M 01/01/00 12/31/99
2 123123435455 John Smith 57 Street New York USA 07/03/73 N M 01/01/00 12/31/99
3 123123435232 Ivan Vasiljevic 7 Plosad Moscow Russia 01/02/93 N M 01/01/00 12/31/99
4 122999323237 Helen Grant 12 Waterloo road London UK 03/03/53 N F 01/01/00 12/31/99
5 53566232322 Helen Felder 2 Gasse Berlin Germany 01/03/81 Y F 01/01/00 12/31/99
6 831232131120 Sara Schubert 3 Gasse Frankfurt Germany 09/03/03 N F 01/01/00 12/31/99
7 934230842030 Samuel Little 27 Street San Francisco USA 09/30/68 Y M 01/01/00 12/31/99
8 198763111222 Jannet Nolte 29 Street New York USA 10/21/62 N F 01/01/00 12/31/99
9 832378872030 Margareth Teacher 2 Street Boston USA 11/01/98 N F 01/01/00 12/31/99
10 921318422030 Ivan Ivanovic 9 Plosad Vladivostok Russia 12/03/57 N M 01/01/00 12/31/99

Now suppose that we get the delta of data in the current data load:

CUST_NATURAL_KEY FIRST_NAME LAST_NAME ADDRESS_RESIDENCE CITY_RESIDENCE COUNTRY_RESIDENCE DATE_OF_BIRTH MARITAL_STATUS GENDER
198763111222 Jannet Nolte 27 Street New York USA 10/21/62 N F
832378872030 Margareth Smith 2 Street Boston USA 11/01/98 Y F
921318422030 Ivan Ivanovic 21 Cirk Kazan Russia 12/03/57 N M

In the delta data, we see that Jannet moved to a new address in the same city, Margareth got married and changed her last name, and Ivan moved to a new city.

So we need our data to display historical data up to a set point in time (captured with the date_from and date_to fields). In our case, this is the time of the load.

We will break the action into two parts: inserting the new and updating the old. First we insert the new data (if any table constraints hinder this, they must be disabled in the ETL process):

The new rows are now contained in the table with the old data:

ID CUST_NATURAL_KEY FIRST_NAME LAST_NAME ADDRESS_RESIDENCE CITY_RESIDENCE MARITAL_STATUS VALID_FROM VALID_TO
8 198763111222 Jannet Nolte 29 Street New York N 01/01/00 12/31/99
11 198763111222 Jannet Nolte 27 Street New York N 10/01/16 12/31/99
9 832378872030 Margareth Teacher 2 Street Boston N 01/01/00 12/31/99
12 832378872030 Margareth Smith 2 Street Boston Y 10/01/16 12/31/99
10 921318422030 Ivan Ivanovic 9 Plosad Vladivostok N 01/01/00 12/31/99
13 921318422030 Ivan Ivanovic 21 Cirk Kazan N 10/01/16 12/31/99

Note that each new row defined by the natural key has a valid_from value of “10-01-16”. This is the ETL load date from which we consider the row valid.

Now we update the old date by closing the relevant records. As you see, the old rows with the ID values of “8”,”9”,and “10” need to be closed. We do this with by inputting a valid_to date that corresponds with the new rows’ valid_from date. This happens with help from SQL analytical functions:

Note: The above algorithm updates the entire dim_customer table. It is sometimes wise to filter the changed rows in a separate table or use a MERGE statement.

After the update statement, this data is returned:

ID CUST_NATURAL_KEY FIRST_NAME LAST_NAME ADDRESS_RESIDENCE CITY_RESIDENCE MARITAL_STATUS VALID_FROM VALID_TO
8 198763111222 Jannet Nolte 29 Street New York N 01/01/00 10/01/16
11 198763111222 Jannet Nolte 27 Street New York N 10/01/16 12/31/99
9 832378872030 Margareth Teacher 2 Street Boston N 01/01/00 10/01/16
12 832378872030 Margareth Smith 2 Street Boston Y 10/01/16 12/31/99
10 921318422030 Ivan Ivanovic 9 Plosad Vladivostok N 01/01/00 10/01/16
13 921318422030 Ivan Ivanovic 21 Cirk Kazan N 10/01/16 12/31/99

There are other ways to implement the closing of Type 2 dimensions in data warehousing. Why not do a little experimentation and see what you can come up with on your own?

4. Fact Table Period Merges

One of the interesting problems we encounter in data warehousing is the period merge. This is needed when an event has multiple time periods and we need to construct one overlaying time period.

Let’s explain this using a simple example. Say a client has a couple of loan accounts and he is late with some of the payments. If we want to find out the time period in which the client is late, we need a period merge.

Simple merge

Picture 3: Simple merge

So if we look at the picture above, we see that the period in which the client was late with his payments was between January 1 and October 1.

Of course, a client can have many accounts that have a complex interrelationship. The following picture shows this:

Picture 4: Complex merge

Note: The period of December 1, 2016 to February 1, 2017 is not merged because it does not intersect with any other periods.

Next, we will look at a brief model that demonstrates the SQL needed for a period merge:

The data from f_default_periods is the shown here:

ID DIM_CUSTOMER_ID ACCOUNT_NUM ASSET_LEVEL VALID_FROM VALID_TO
1 9000 1000 D 01/01/16 08/01/16
2 9000 1001 D 03/01/16 10/01/16
3 9000 1002 D 02/01/16 06/01/16
4 9000 1003 D 07/01/16 11/01/16
5 9000 1004 D 12/01/16 02/01/17

To clarify, we will partition the SQL code into five steps. Step one, shown below, uses a simple LAG function on the SQL statement:

We are only looking at periods in which the account is in default. We identify these periods using the clause WHERE asset_level = “D”.

And we get this data set:

DIM_CUSTOMER_ID VALID_FROM VALID_TO L_VALID_TO
9000 02/01/16 06/01/16 06/01/16
9000 01/01/16 08/01/16 06/01/16
9000 03/01/16 10/01/16 08/01/16
9000 07/01/16 11/01/16 10/01/16
9000 12/01/16 02/01/17 11/01/16

In part two, we use a running max of the l_valid_to data:

We are laying the groundwork to identify any gaps in our time period.

DIM_CUSTOMER_ID VALID_FROM VALID_TO RUNNING_MAX
9000 02/01/16 06/01/16 06/01/16
9000 01/01/16 08/01/16 06/01/16
9000 03/01/16 10/01/16 08/01/16
9000 07/01/16 11/01/16 10/01/16
9000 12/01/16 02/01/17 11/01/16

If we see that the running_max value is lower than the current valid_from value, we know that we have formed one merged group. And we can apply the previous “snapshot to Type 2” algorithm.

In part three, we flag the groups:

The data in this step is:

DIM_CUSTOMER_ID VALID_FROM VALID_TO RUNNING_MAX FLAG_GAP
9000 02/01/16 06/01/16 06/01/16 0
9000 01/01/16 08/01/16 06/01/16 0
9000 03/01/16 10/01/16 08/01/16 0
9000 07/01/16 11/01/16 10/01/16 0
9000 12/01/16 02/01/17 11/01/16 1

In part four, we apply the running_max sum to the flag_gap, just as in the previous algorithm:

In this step, the data is the same as in the previous step because we only have two merge groups:

DIM_CUSTOMER_ID VALID_FROM VALID_TO RUN_FLAG_GAP
9000 02/01/16 06/01/16 0
9000 01/01/16 08/01/16 0
9000 03/01/16 10/01/16 0
9000 07/01/16 11/01/16 0
9000 12/01/16 02/01/17 1

Finally, in the fifth and final step we simply aggregate the data and get the time periods of the merged groups:

The final result data shows our two merged periods:

DIM_CUSTOMER_ID VALID_FROM VALID_TO
9000 01/01/16 11/01/16
9000 12/01/16 02/01/17

This algorithm was a little bit tricky! But I am sure that once you master it, it will give you the foundation to come up with something exciting on your own.

Build Your Own SQL Algorithms

In this post, we saw some common SQL algorithms used to work with batch data. We started with something simple and built up to more complex statements. As you observed with the snapshot to Type 2 and period merge use cases, SQL algorithms can and often do build on each other. I encourage you to create something on your own, and in pure SQL. The language has all the strength – not to mention beauty – to meet any data manipulation need. And when you do come up with something, share it in the comments section below!

Tools in this post

Redgate Data Modeler

Design, update, and manage database schemas

Find out more