Mimicking Magnetic Tape in SQL

The sequential nature of early data storage devices such as punched card and magnetic tape once forced programmers to devise algorithms that made the best of sequential access. These ways of doing data-processing have become so entrenched that they are still used in modern relational database systems. There is now a better way, as Joe explains.

1343-img172.gif

I keep telling people that they are writing magnetic tape and punch card programs in SQL. They reply that they do not know what a punch card is, and have never seen a magnetic tape drive. Therefore, they believe that their SQL is just fine.

Let’s ‘Wikipedia’ and ‘Google up’ pictures of punch cards and tape drives, so the kids can be grateful for what they have today. The physical media is not what is important; the consequences of the physical media are.

When you sit in a live theater, you cannot do a close-up, pan shot, zoom, dissolve or other effect that are common in movies today. Early silent films parked the camera in one position and mimicked a theater experience. This is a general systems principle that the new technology will first mimic the previous technology before it finds its own voice.

1343-img176.gif

Look at a deck of punch cards or a reel of tape. Their records are in a sequential file structure, necessitated by the physical media. Random access in a deck of punch cards is impossible; random access in tape is impractical. Whenever you see a table with an IDENTITY property, the programmer is mimicking that sequential physical ordering and not doing RDBMS modeling.

This means that sorted order is fundamental in sequential files. It also means that we process things one record at a time; we have innate concepts of first, last, current, prior and next records. The world is “left to right” and there is no higher level abstractions. Fields are all fixed length strings that are read by an application program to get their meaning – i.e. no data types, defaults or constraints in the data itself.

Another innate property of punch cards and tapes is that you can concatenate contiguous fields to create a new field. COBOL, the classic language for this file structure, has hierarchical sub-fields when you define your records in the DATA DIVISION of a program. I know most Microsoft programmers do not know or even read COBOL, so let me give a simple address of a US mailing address.

We have a field called “Address” at the highest level. It is a string in contiguous storage. We can access two sub-fields, named “address-line-1” and “address-line-2”, the first one is a 40 string of alpha characters. The second sub-field is made of sub-sub-fields that can also be accessed by name; FILLER is a special token that means it is ignored or replaced by a constant.

When you see "(CAST (terminal_nbr AS CHAR(5)) + CAST (transaction_seq AS CHAR(8))) AS sale_id” you know that they are still doing COBOL in SQL.

There was an excellent example of this mindset on the MS SQL Server forum recently. It was a table valued function which returned what is called a delta report. That is, a comparison of the change in a set of variables from one report period to another, usually annual. In English, these reports answer questions like “What is the change in sales for this year as compared to last year?”

The poster was having performance problems. This was no surprise. I am not going to post the full example; I do not need everything to make my points and a reduced model will serve. Here is my slimmed-down Sales table.

What we want to do is look at the sales in 2011 and 2010, and see how much things changed. The real table had a lot of other values, was comparing customers from this year against last year.

The first thing an old tape file programmer would do is draw a flowchart. Again, I am not sure that younger programmers have ever see a flowchart, but here is one.

The circle with a tab sticking out of it is a reel of tape. The triangle is a sequential merge operation. The rectangles are programs. The rectangle with a wavy bottom is a printout. The arrows show the flow of control and/or data. Got it?

1343-img168.gifHere is how it works:

  1. Mount the Master Sales on a tape drive. Assume it is sorted by (store_id, sales date) rather than (sales_date, store_id); this is important– very important, and will talk about it shortly.
  2. We have a program that read the master and extracts the 20110 data to a scratch tape, mounted on a second tape drive. We then rewind the master tape and signal the operto4r that we are ready or the second scratch tape.
  3. The operator then mounts a second scratch tape and the second process which tape drive that tape is on.
  4. The operator tells the merge process where the first tape, second tape and final scratch (merge) tapes are mounted.
  5.  The third scratch tape is read by a process that sums each pair of 2011-2010 values based on the dates.
  6.  A final process does a grand total by store_id and makes a printout

Here is an anorexic re-write of the poster’s SQL function. Again, the original code was much more complicated, was implemented as a table-valued function, had other design flaws and so forth. The parameters were a date range pair in the original code.

The CTE “This_Year” is the first scratch tape. The CTE ” Last_Year” is the second scratch tape. It is a direct translation of the flowchart from 1950’s into SQL. The tape merge process is translated directly into the FULL OUTER JOIN and it becomes a CTE. The main SELECT then returns the final process from the flowchart.

Yes, the CASE expressions should be COALESCE() expressions, but that is how it was done in the original because the programmer is still stuck in IF-THEN logic mindset. This lets him write SQL that looks as much like his procedural language as possible.

Do you see the magnetic tape mindset? Now let’s fix it. SQL is a data language, not a computational language, not a processing language. The first thing is that reports are done for fixed known periods in production work. It is suicide to allow date ranges as parameters because users will get “creative” and you cannot be sure they were creative in the same way. That is for ad hoc queries.

Getting back to the issue of sort order as a factor in the logic. Because the data in the Master Sales is sorted by (store_id, sales_date) we had to make two passes through the tape to get both years.

This assumes we have only two tape drives. If we had 3 or more tape drives, then we could have split out the data for 2011 and 2010 in one pass over the Master tape. In fact, you need three tape drives to do any real work. There was a horror store in a recent issue of COMPUTERWORLD’s “Shark Tank” column. The shop the writer worked in had only two old slow tape drives. They asked the boss for three newer tape drives. Instead the boss bought them two new tape drives that were twice as fast – 2 times 2 = 4 times faster, right?

So, first thing we do is create a table with our reporting periods; I am just doing the (2011-2010) delta.

Sneaky trick here. The two rows in the table model the two years in the 2011 report period. I happen to like the MySQL convention of using “yyyy-00-00” for a character-representation of a whole year; it makes sorting easier and it is language independent. Notice the prior year has a “report_delta” of plus or minus one. Instead of conditional logic, I use data in a data language.

Thinking in aggregates instead of sequences, we see that

Now, here is the replacement procedure, with a SQL mindset.

That is the whole thing! One SELECT, no CTEs and no danger of a bad report range. See why I spend so much time beating up people about tables are not files, rows are not records and columns are not fields; it really matters.