A Tale of Two Tables: Header Rows in ETL

Extract, Transform, and Load (ETL) scenarios sometimes are perceived as mundane and, dare I say it, even boring. Yet I find them fascinating opportunities to apply out-of-the box thinking.

Extract, Transform, and Load (ETL) scenarios sometimes are perceived as mundane and, dare I say it, even boring. Yet I find them fascinating opportunities to apply out-of-the box thinking. Data rarely presents in just the format expected by SQL*Loader or external table syntax. On a good day one builds a straightforward external table and job done. Or is that on a bad day? On a good day, on a truly good day, one has the opportunity to approach a load as if a child again playing with LEGO bricks. Taking the various features of Oracle Database, how do you snap them together to obtain a desired result?

The Problem

Recently I received a request for help from a reader seeking a way to load from a data file having one header record, one footer record, and many detail records sandwiched in between. Date and time values from the header record were to be applied to each of the detail rows. Following is an example data file that is essentially in the same form the reader was dealing with:

The desired result is to insert the data as if the date and time in the header row were included in each of the detail rows. For example:

I enjoy this type of problem. There’s no built-in feature specifically for this scenario. The challenge of rummaging through the available feature set and snapping disparate pieces together into a well-crafted solution is stimulating. My first attempt at an “obvious” solution was a failure, but that served only to make the problem all the more interesting. The game was afoot!

Solution Approach

Figure 1 shows the solution I eventually settled upon. I’ll walk you through it in this article. It involves two external tables: one matching the header record, and the other matching the detail records. A preprocessing script on the first table ensures only the one header record is returned by any query against that table. With the external tables in place, the load is a perfect example of when PL/SQL can be used to good effect.

Solution Approach Flow Diagram

Figure 1. Solution flow diagram

Is a SQL*Loader solution possible? The answer is yes. I was able to conjure up not one, but actually two solutions based around SQL*Loader. Those solutions were fun hacks to invent, but I favor the more straightforward approach shown in Figure 1 for a production system.

Step 1: Create the Operating System Directories

First, create the following three operating system directories. The goal here is to keep the data files separate from the log files, and especially to keep control over the scripts directory.

It’s vital that you as the database administrator (DBA) keep tight control over the scripts directory. I can’t emphasize that enough. Preprocessing scripts are executed by the Oracle server software as the Oracle software owner. You can well imagine the risk from allowing just anyone to create and execute a script as user oracle. Keep a tight lid on preprocessing scripts. Review each one.

Step 2: Create the Oracle Directory Objects

Next is to create Oracle Database directory objects that map to the three operating system directories. These provide a level of operating-system independence in your ETL solutions. I logged in as SYSDBA to create these objects. For example:

While you’re still logged in as SYSDBA, take the time to grant yourself execute access to the scripts directory. Here’s the command to run:

The need to grant access on the scripts directory is due to the security impact of preprocessing scripts running as the Oracle software owner. Oracle Database does not by default allow execution of such scripts. You as the DBA must consciously choose to grant the needed access. Keep a tight lid on whom you grant it to.

From here on out, everything can be done while logged in as the user who’ll ultimately be running the load. I logged in as user GENNICK.

Step 3: Write the Preprocessing Script

Preprocessing scripts are new in Oracle Database 11g Release 2. Arup Nanda has written a nice article about them that you may also wish to read. Using the feature you can completely transform a data file prior to its being read through an external table.

For our scenario, create a script having the following command to return just the first record of a file. The file’s name will be passed as the first parameter, indicated by the $1 marker. Name the script bike_inventory_head.sh. Place the script in your scripts directory.

Issue a chmod command to make the script executable:

Switch to your data directory and test the script as follows:

You should get back the very first line from the data file. It’s a simple script. Debug it now if needed. It’s easier to test and debug these scripts from the command-line than when invoking them through an external table operation.

Step 4: Create the Target Table

You’ll want to create a target table if you’re following along by working the example. Create the table bike, as follows.

Table bike is the target. The ETL procedure to follow reads from bike_inventory.dat, and loads the data into bike.

Step 5: Create the First External Table

Now we’re getting to the fun part. Create the following, external table in the same schema as bike. The purpose of this first table is to give easy access to the header record. The one important detail to notice is the PREPROCESSER directive specifying the script created earlier in Step 3.

Create the table, and then query it as in the following example. You should get one row back. That row should contain the header information.

What happened to all the other records in the input file? The answer is they aren’t there. As far as Oracle Database is concerned, the one row returned by the preprocessor script is all there is. Figure 2 walks you through the process.

Reading Header Record

Figure 2. Reading the header record.

If you aren’t on Oracle Database 11g Release 2 yet, you can still apply the technique of a preprocessing script. You’ll just need to somehow trigger execution of the script from outside the database, and you’ll need to write the one header row to a temporary file. You’ll have to do more work for the same effect. Oracle’s preprocessing feature saves work and eliminates the need for a temporary file. It’s a nice feature.

Interestingly, I at first missed the need for a preprocessing script. I mistakenly thought that I could simply use a WHERE ROWNUM = 1 predicate to read only the first row from the input file. My reward was an error due the reject limit being exceeded. For example:

Removing the reject limit by adding REJECT LIMIT UNLIMITED to the end of the CREATE TABLE statement solved one problem only to create another: A look at my log file showed that the database engine was reading all the records from the input file. It offends my sense of elegance to read an entire data file just to get the first record.

Next I tried an OPEN, FETCH, CLOSE sequence using a cursor from PL/SQL. That generated the same results – more records read than I really wanted, errors thrown, my sense of elegance offended. Using the preprocessor script keeps everything neat and tidy. Only one record is read. Simple. Elegant.

Step 6: Create the Second External Table

One preliminary left before performing the actual load, and that is to create a second external table. This time, we are creating the external table through which to read the data records. Following is the code. Notice in particular the LOAD WHEN clause.

The LOAD WHEN clause skips the header and footer records, which begin with “START” and “END” respectively. Query the table, and you’ll see the data from bike_inventory.dat. For example:

All the pieces are in place. Now you can run the load.

Step 7: Run the Load

Remember the original requirement? It was to include the date and time from the header record with each new row inserted into the target table. You can implement the logic easily in PL/SQL. At a high level, the steps are simple:

  1. Query for the header record to get the date and time in text form.

  2. Convert the date and time into a value of type TIMESTAMP.

  3. Perform the load by executing an INSERT…SELECT FROM statement.

  4. Include the TIMESTAMP value as one of the selected columns.

Following is a PL/SQL block implementing the above logic. I’ve kept the exception handling simple to focus on the core logic of the block. However, the exception handling is still sufficient for the task. You’ll see that I’ve specifically written a check to verify the assumption that the first record is a header record beginning with START. It’s a good habit to include code to verify assumptions.

Execute this PL/SQL block. Query the bike table. You should have data:

It may seem a bit odd at first to have two external tables essentially reading from the same operating-system file. Don’t let that dissuade you. The approach works well in this case. It’s an elegant solution to the original requirement.

Subsequent loads are trivial. Once you have the external tables in place, and the preprocessing script written and debugged, subsequent loads are as easy as copying in a new data file and executing the PL/SQL block. If you like, you can convert the block into a stored procedure.

A SQL*Loader Solution

What? You just absolutely must use SQL*Loader?

Trust me. I understand.

Perhaps your site won’t allow preprocessing scripts due the security issues. Or maybe you need to get the load done today, and you can’t get on your DBA’s priority list until next week. External tables are wonderful, but using them tends to require some amount of DBA support. If you’re not the DBA and you need to get some data loaded, sometimes you just do what you’ve got to do. With that in mind, I present very quickly one possible solution built around SQL*Loader.

Begin by clearing the data you’ve just loaded using the external table solution:

Then create the following table to hold the header row:

Now create the following PL/SQL package. The package caches the timestamp from the header record. SQL*Loader will invoke the package’s get_timestamp function for each detail row.

Create the following control file to tie everything together. The WHEN clauses in the control file direct the header record into the bike_load_header table, and all the detail records into the bike table. Name the control file bike_load.ctl.

Finally, execute the load:

Here’s a short summary of how the load progresses:

  1. SQL*Loader reads the first record, which is the header record.

  2. The header record matches the first WHEN clause in the SQL*Loader control file, so an insert is generated into the bike_load_header table.

  3. The REPLACE option causes SQL*Loader to delete any header data from an earlier load.

  4. The insert completes, saving the header data in bike_load_header.

  5. Subsequent, non-header records match the second WHEN clause, generating inserts into table bike.

  6. The value of the fourth column for table bike is derived from a call to the function bl03.get_timestamp.

  7. The first invocation of the function triggers execution of package bl03‘s initialization code, which queries table bike_load_header. The resulting timestamp is stored in a package variable and will persist the duration of the session.

  8. That timestamp is returned for each invocation of the function, and is thus supplied to each of the detail records in the input file.

  9. The final record is a footer record. The footer record fails both WHEN clauses, making it essentially ignored.

The use of TRAILING NULLCOLS following the second WHEN clause prevents SQL*Loader from throwing an error because data records have only three fields, whereas the SQL*Loader control file describes four fields. Remove the TRAILING NULLCOLS option, and every one of the detail records will be perceived as being too short, and will throw an error.

The specification POSITION (1) in the second field listing is especially worth noticing. When you’re describing delimited fields, SQL*Loader moves left to right unless you specifically reset the position to the beginning. Without that use of POSITION (1), SQL*Loader would expect bike_id to be the fourth delimited column, bike_description to be the fifth delimited column, and so forth. Since the second field listing is a complete and alternate description of an input record, you must specify POSITION(1) so that SQL*Loader knows to back up and start over counting again from the beginning of the record.

Bottom Line

I prefer the external table solution. It’s simple and elegant. It’s a forward-looking solution making use of features that Oracle is driving towards rather than away from. Yet it requires some DBA assistance to put all the pieces in place, and the security considerations surrounding preprocessing scripts might be off-putting in some environments.

If preprocessing scripts aren’t an option – whether from site policy or from running an older release of Oracle Database – then the SQL*Loader solution is a good hack that can get the job done. Or perhaps you’re just more comfortable in using SQL*Loader.

An advantage in SQL*Loader’s favor is that it gives the option of loading to a remote database. Imagine, for example, that you’re running an Oracle Database instance through the Amazon Relational Database Service. You’ll have a listener endpoint through which to connect to the database, but you won’t have any server level access at all. SQL*Loader can get the job done in that scenario.

An advantage toward the external table approach is the ability to run a direct-path load. Simply add the APPEND hint as follows to the final INSERT statement in the PL/SQL block:

Run the block. You’ll get a direct-path load. You can tell, because you’ll need to commit before querying the target table from the same session.

A direct-path load isn’t possible in the SQL*Loader solution. That’s because SQL processing (i.e., the function call to get_timestamp) is involved during the load process.

More solutions are possible than I can show in this article. The SQL*Loader solution can be modified to use a trigger instead of a stored procedure. The external table solution can be rebuilt around a pipelined table function – an approach I plan to cover in a follow-up article. It really is like building with LEGO bricks. Put the pieces together in different ways. Explore your options. Ignore what others think. Choose a solution that works for you given your skill set and current working environment, and the amount and frequency of data to be loaded.

Acknowledgments: Credit and gratitude to Darl Kuhn, Tim Gorman, and Arup Nanda for reviewing an early draft of this article. Any errors I’ve managed to sneak by their careful scrutiny are entirely on my own head. Especial thanks to Vipul Jain for being the reader who set me down the path of exploring some fun solutions to an interesting and real-life problem.