SQL Style Habits: Attack of the Skeuomorphs

Comments 0

Share to social media

I love the word “skeuomorphism” because it sounds so weird and has a lot of Scrabble points. The dictionary definition is “A skeuomorph is a derivative object that retains ornaments from structures that were necessary in the original.” and it is usually applied to physical design. For example, something was made in wood, but now we make it in plastic. Rather than use the properties of plastic, we keep the colors and woodgrain textures of the original wood.

However, things tend to find their own voice over time and evolve into a new creature. The first cellphone ring tones mimicked the traditional desk telephone bell, then later you got to pick from a list and use the tones as a signal of the caller’s identity. “Chopin’s Funeral March” for your boss, something romantic for a lover and so forth. The problem is that the old skeuomorph too often hangs around as a vestigial organ. It is not just the physical things, like the watch pocket on Levi Jeans (http://www.levisguide.com/watchpocket/watchpocket.html). It includes pieces of your mindset that you have picked up, often without knowing what it is or why it exists. I now have to tell an old joke:

A young woman was preparing a ham by cutting of the protruding ham bone before putting it in the oven. Her friend asked her, “Why did you cut off the bone”?

And she replied, “I really don’t know but my mother always did, so I thought you were supposed to.”

Later when talking to her mother she asked her why she cut off the bone before baking it, and her mother replied, “I really don’t know, but that’s the way my mom always did it.”

A few weeks later while visiting her grandmother, the young woman asked, “Grandma, why is it that you cut off the ham bone before you bake it?”

Her grandmother replied, “When I was first married, we lived in a small apartment with a tiny stove. The oven was so small you had to cut off the ham bone to get it to fit and cook it a long time.”

Mainframes & Tapes

This rambling vocabulary lesson leads us to our “first small apartment” – Mainframes, computers and the FORTRAN I – and COBOL programming languages. Most of the readers have not even seen mainframes.

2256-7ecf8d8e-682c-4c25-858e-e72a4916134

Fig 1: IBM System 360 Model 65 administered to by Mark 1 geeks.

Originally, this hardware used magnetic tape drives and punch cards. If you want to look EBCDIC, ASCII, Fielddata, TTY and several other competing binary encoding schemes that were in use in those days, get a copy of Coded Character Sets: History and Development by Charles E. Mackenzie (ISBN 0-201-14460-3). The important part of a magnetic tape is that it is a sequential-access-only file structure, with records made of physically contiguous fields.

In order to know where the read/write head on the tape drive is located on the tape, the system has to count the physical records that have been read. The first tape drives mimicked punch card readers in the way they read in one direction only. Old T-SQL programmers will remember the early Sybase implementations whose cursors could read forward only.

When the read/write head is placed on the start of record, it increments a counter and looks at a flag. The flag tells the system if the record is active or not. Think about the alternatives. You could write over the record with zeroes, but that takes time. You could physically move the active records down toward the start of the tape, but that takes insanely more time. Tapes are not random access media.

This is where the concept of a Master file or tape comes in. We have one tape that holds the current data. On a regular schedule, we take other tapes with new data, sort them in the same order as the master and merge them into a new Master. The old Master and the transaction tapes are then archived.

Tape is cheap, but it is bulky. Managing the tapes required a tape library program and human operators to mount and dismount the tapes from drives. The reels of tapes had physical labels, and the most common labeling system was based on the ISO-8601 ordinal date. But in those days we only used the year-within-century for the first two digits; this was part of the “Y2K problem” that scared us back then, but that is another story.

2256-051094c2-22df-4564-b46a-787d760353f

Fig 2: ‘Since punch cards have 80 physical columns in them, these records also had 80 columns.’

There were also header records on the tape. Vendors had various header record formats, but there was a standard (http://www.ecma-international.org/publications/files/ECMA-ST/Ecma-013.pdf). Since punch cards have 80 physical columns in them, these records also had 80 columns. The size of the actual data records, how they are blocked and other things to do with how to read the data of the tape. Today, we would call this meta data. But besides the physical layout descriptors, there were things like “creation date” and data source information.

Now go to any SQL Forum text search the postings. You will find thousands of postings with DDL that include columns named “createdby“, “createddate“, “modifiedby” and “modifieddate” with that particular meta data on the end of the row declaration. It is the old mag tape header label written in a new language! Deja Vu!

The header records appeared only once on a tape. But these meta data values appear over and over on every row in the table. One of the main reasons for using databases (not just SQL) was to remove redundancy from the data; this just adds more redundancy. But now think about what happens to the audit trail when a row is deleted? What happens to the audit trail when a row is updated? The trail is destroyed. The audit data should be separated from the schema. Would you put the log file on the same disk drive as the database? Would an accountant let the same person approve and receive a payment?

Today, we have tiered architectures. We have front-end layers, presentation layers, a report tool, analysis & statistics, audit tools and other such things which reside inside their own fiefdoms (often their own servers), with well-defined interfaces among them.

But when you work with monolithic programs, there are no layers. Everything has to be in the file. This leads to a lot of redundancy and puts a burden on the programmer who has to keep everything in sync in application code. In RDBMS, the schema is the unit of work. Tables in the same schema can reference each other via DRI. If I change a customer address, all the tables (present and future), will cascade the changes and reference the same row in the referenced table. The principle is “one fact, one time, one way, one place” as part of normalization.

The PRIMARY KEY is another skeuomorph from tape files. Obviously you can only sort a tape in one and, only one, order. This fact was so ingrained in our mindset, that when we built the first SQLs on top of sequential file systems, we preserved the old familiar model. Logically, all keys have equal “key-ness” to them (i.e. NOT NULL and UNIQUE properties), so there is no reason to mark one of them as special. Except it feels good to us.

Dr. Codd later corrected this in his relational models, but the damage was done in SQL. In particular, in SQL Server, the PRIMARY KEY defaults to a CLUSTERED index to mimic the original tape files.

This “mag tape mindset” was also the reason that GROUP BY was done by sorting to create groupings, so that an ORDER BY clause was not needed in early T-SQL programs. Nobody thought about building grouping with hashing with parallel processors.

FORTRAN I

One of the first programming languages was FORTRAN (short for “Formula Translator”) which was designed by. John Backus and first appeared in 1957 on IBM hardware. It was a simple procedural language with less than fifty statements and only two data types – INTEGER and FLOAT. There was no data declaration in the language at this point. Instead, the first time the compiler read a variable name, it determined its data type by the first letter. Those that began with I through N were Integers (get it?) and all the other were floating point.

It is also worth noting that FORTRAN I had only six letter names and uppercase letters. No, really. I recently saw a posting on a SQL forum that still followed those limits.

The early versions of BASIC used $ for strings. And you probably have noticed that Sybase T-SQL uses @, #, @@, and ## for the same reason. It is now too late to change it, thanks to the “code museum” effect of legacy code.

The use of an initial letter or prefix was carried over from early operating systems that also used one-pass compilers. This might be a stylized name with a number (MT0, MT1, MT2, ..) that identifies the physical device, not the data on the device. Every time you see a FROM clause with alphabetical aliases in alphabetical order, the programmer has mimicked physical tape or disk drives that held files instead of tables.

2256-e611513a-5915-4255-9828-0a1c386bd73

Fig 3 ‘One day perhaps we’ll persist as skeuomorphs’

Those SQL programmers who code on auto-pilot keep this skeuomorph by using the data types as prefixes, but carry the design flaw further. They tibble! They put an affix like “tbl-“, “-table“, “tb-” in a table name. Think how silly this is! SQL has one, and only one, data structure: the table. This is like prefixing every word with “word-” in your sentences.

But it does not end there. They will add data type affixes to column names. They prefix VIEWs with “vw_” (this is called “Volkswagen-ing”) and use “udf-” or “fn-” prefixes on user defined functions. This last one is special because the FN- prefix was required by FORTRAN (later by BASIC) for in-line functions.

The final ghost of FORTRAN is in how we had to implement a function. Stealing a simple example from the great Hugo Kornelis:

Did you notice the local variable @result? Why not just write:

and not bother to create and load a local variable? Other languages have always allowed this, and in particular declarative languages really like doing it. The answer is a skeuomorphism from IBM hardware. Early machines had registers, accumulators and other specialized hardware. The result of a function had to come back via a register in the first FORTRAN compilers because of hardware limits.

COBOL

Today, we need to name a thing for what it is by its fundamental nature. This is the ISO-11179 standard in ten words or less! A data element uses the syntax “[<role>_]<attribute>_<attribute property>” so that the name is complete in itself, not dependent on a context.

But in COBOL, all data is kept in strings that stored as physically contiguous records. The fields that make up the records sub-strings are usually nested in a hierarchy. For example, a customer address record is made up of a house number, street name, city, state and postal code sub-fields. The sub-fields are always in the context of the “customer address” and never considered alone. This is the nature of a hierarchy versus a set.

But in RDBMS, they are separate data elements, and I need to know if this is “city_name“, “city_population“, “city_area” or whatever. I have to have what ISO calls an attribute property to name it correctly. If I use that data element in multiple roles in the same scope, I need to qualify it. For example “boss_emp_id” and “emp_id” are (possibly) different elements from the set “Personnel“, which play two roles in the data model.

COBOL has no presentation layer, so the data has to be put in display format in the DATA DIVISION (the COBOL version of DDL) by using a PICTURE clause. This lets you put commas, dollar signs, decimal points and other punctuation marks in the strings. Ever wonder about the MONEY and SMALLMONEY data types in T-SQL? They do what PICTURE does! Likewise, the CONVERT() string function with temporal data types is another hidden COBOL skeuomorph.

The main verb (COBOL term for the operators) is MOVE. This shows just how much the physical model of data dominates this language. The strings are copied from field to field and get formatted by the destination PICTURE clauses. This language also likes to use flags to track changes to the data and computations from other levels of aggregation. In the old days, we did not have much storage, so an application was broken into a series of steps that passed along their output to the next process. Transaction tapes would be sorted with a polyphase merge (look it up; it is a fun algorithm), edited, then merged into a new Master, and finally taken over to a printer to produce a hardcopy output.

None of the steps had direct contact with the previous or following step; they communicated by setting flags or doing totals that are passed along with the data.

You can find this same logic done in SQL.

Today, we can use a MERGE statement and get some safety, but the original UPDATE statement works just fine

The quota flag can be set after you have gotten the total. The next process step.

But the “SQLway” would be:

The quota flag is not in the VIEW. Let the presentation layer decide what the business rules should be. This view is not materialized until invoked and it is always current and correct. Just like the ringtone on your cell phone is not hard-wired into a physical bell inside a fixed land line any more.

Load comments

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.