A Tale of Identifiers

Identifiers aren't locators, and they aren't pointers or links either. They are a logical concept in a relational database, and, unlike the more traditional methods of accessing data, don't derive from the way that data gets stored. Identifiers uniquely identify members of the set, and it should be possible to validate and verify them. Celko somehow involves watches and taxi cabs to illustrate the point.

1208-New_Punched%20Card2.jpg

Programmers still have problems with the concept of an identifier. Perhaps it is true that the more things change, the more they stay the same. So let me start with a bit of history and start with punch cards and magnetic tapes. These were the first way we had to store data and I am pretty sure that 95%+ my readership has never used them. You might want to look at this page to get an idea of what you missed. Or see an old science fiction movie.

Magnetic tapes followed punch cards. Tape was a big improvement. Tapes were denser, so you got more data on a reel of tape than in a deck of cards. If you ever dropped a deck of punch cards and dropped a reel of tape, you know which one you prefer. Tapes can rewind; punch cards do not. However the first tape drives did not read backwards; even if they had, the software was based on punch cards, so it assumed forward reading. Today, in T-SQL, cursors can be declared to read forward, too. The old mindset is hard to leave.

Tapes and punch cards have very different physical properties. But they have the same logical properties. They physically allow only sequential access. To get to a record, you read the media from left to right. There was a first record, a last record and the guys in the middle were located by counting records. The count was either absolute (n-th record) or relative (next, current, prior, etc). The fields in the records are also read from left to right. There is nothing set-oriented about this data model. Obviously, the record number depended on the order of insertion. We spent a lot of time sorting and merging the decks of punch cards and tapes on various fields to get them ready for processing. We worried about stable versus unstable sorts and a lot of other physical issues that modern programmers don’t have to consider.

1208-New_MagTapeHeaven2.jpg

Here is where we got the idea of a sort key versus an identifier. For example, a billing application would have account numbers to identify the customers, but the tape would be sorted on ZIP codes for printing the statements for mailing. But we needed to have the Master tape kept in account order for merging updates and inserts into it. It was also how we looked up account data. The result was that we did extractions from Master tapes for reporting.

When a record was deleted from a tape, it was marked with a flag at the start of the record. So there was a logical count of active records and a physical count of actual records for locating data on the tape. Eventually, the old Master tape would be merged with a Transaction tape and the deleted records never moved over.

This is where the idea of a PRIMARY KEY started. Using the sort order was the only practical way to get to the data. Think about random access on hundred a meters of magnetic tape; not a good idea.

Indexes

Along came spinning magnetic media in various forms and the world changed again. Disks, drum and other spinning magnetic media were originally known as random access storage devices. You could put a read-write head anywhere without doing a rewind! But we did not take advantage of this at first.

Sequential reads that let spinning media spin and held the read-write head still were faster and that was the way the software was written. But random access made sorting faster, so we were happy. Eventually, we invented ISAM (Indexed Sequential Access Method) and its descendants. The file is sorted, but we have an index that lets us place the read-write head directly on or near a particular record. The usual analogy is an unabridged dictionary with a thumb or notch index. The notch lets you open the book to any letter of the alphabet in one operation.

The index would have the track and sector where the data was stored so it could move the read-write head there. The track and sector were a physical locator, but the user never had to see them. But back in those days, we did have to maintain them and run housekeeping routines on them. You also had to explicitly invoke an index in your code; there were no optimizers to do your thinking for you.

Pointers and Linked Lists

A bit later we realized that the track and sector values, written as a “row id” could be used inside a record to point to the next record in a path. The “row id” is not an identifier any more than the record count on a magnetic tape was. It is a physical locator.

This lead to navigational databases. As Dr. Codd was to RDBMS, Charles Bachman was to navigational databases (he received the ACM Turing Award in 1973 and was elected a Distinguished Fellow of the British Computer Society in 1977). He and Codd had a famous “shoot out” in 1974 which featured a simple machine shop scheduling application. SQL did not exist, so the experimental Alpha language was used for the relational version and a CODASYL/COBOL product for the network version. The network version was complicated and had two bugs in it. This ended the navigational model in the popular mind. Bachman is largely forgotten by DB programmers these days, thanks to the domination of SQL and RDBMS, but his pioneering work in database systems is still the basis for many large scale commercial databases.

The navigational model lead to various access structures which all had different names in different products. The records located by pointers are usually called nodes, using terms from graph theory. The simplest connection between nodes is a single pointer that can be followed in one direction only. Remember a deck of punch cards? The terms most often used are (Parent, Child) or (Master, Detail). See how many times you can find a newbie on a forum who uses those terms for tables in SQL. This navigational view of data has no place in SQL, but it is hard to escape the old mindset.

A link was a two-way pointer chain. Remember the later tapes drives that could re-wind and read backwards? Ever hear a newbie invent the term “link table” because they don’t know how to model a relationship? Yep, the more things change, the more things stay the same.

One of the major problems with the navigational model is that you have to navigate it. That is, you need to have an explicit path, not a relationship. To use Chris Date’s classic “Parts & Suppliers” example, consider a DB where each supplier is the parent of a chain of parts. This design is fine if I always want to get a supplier then look at his parts. But if I want to find all the suppliers who stock #5 machine screws, I am in trouble. I have to scan the catalog of each supplier, one at a time. In SQL, I would have a relationship table for “Part_Sources” and query it.

Newbies use IDENTITY to build “mock-pointers” in SQL. Like pointers, IDENTITY is a physical property of a node, er, I mean table. The term IDENTITY column is technically wrong. A column has a data type, all data types in SQL are NULL-able, can be assigned values, have operators, etc. IDENTITY has none of this; it is like a pointer. The hardware picks a value, do you do not assign it. It is not NULL-able. It makes no sense to do math on it, even tho it looks like a numeric. And like a pointer or link, it can be attached to any kind of node, er, table. In one record, we point to an automobile, then to a squid and so forth. I like to think of IDENTITY as a kabbalah number. This is the mystic Hebrew number that God puts on everything in the universe. If you know that number, you have all kinds of magic powers over the object. You can change automobiles into squids!

Identifiers

Identifiers are a logical concept in RDBMS. They are a subset of attributes of an entity or relationship that identify individual member of the set. A VIN identifies an automobile, an ISBN identifies a book, a DUNS identifies a company and so forth. An identifier is specific to one kind of thing; you do not order a book with a VIN.

A common error is thinking that if something is unique (or declared as an IDENTITY property), then it is an identifier or key. Yes, an identifier has to be unique, but not the reverse. The number pi is a unique constant, but does not identify any entity. This error shows up in data element names in look-up tables with monstrosities like “postal_code_id” and worse. It is just a “postal_code”, like pi is a constant. Codes and identifiers are totally different kinds of attribute properties.

Identifiers can be multi-column and long. A three dimensional (x, y, z) coordinate system is multi-column. The International Bank Account number (IBAN) is long.

So what? We have 64 bit hardware and terabytes of fast cheap storage today. I have row constructors in SQL and good text editors. If I have an industry standard key, then I have to use it to get two things I want for data integrity.

Validation & Verification

 

Validation means I can look at a data element value and see that it has the correct form. For example, I know a ZIP code is a string of exactly five digits; a string with less or more digit is not valid; string with alphas or special characters is not valid. I also know the lowest ZIP code is in Holtsville, New York (00501) and the highest is Ketchikan, Alaska (99950).

Verification means there is a test for an attribute like weight (put it on a scale) or trusted source to verify the code or identifier. In the case of the ZIP code, the trusted source is the United States Postal Service (USPS). ZIP codes change, some of them in the proper range are not currently issued, and so forth. This is vital for data integrity and audits.

Many industry standards have check digits and regular expressions that also validate them at data entry time.

IDENTITY has neither validation or verification. It can be any positive or negative integer of any size. The same data put on two different machines will be assigned different IDENTITY values based on the hardware Not a lot of help here. Go to any SQL Server forum and you will find a posting about someone who used BCP to move data to a new machine, but forgot to protect the IDENTITY mock-pointers so he got the tables re-numbered as they were inserted. At least pointer chains were not exposed and we had utility programs in case the chain got broken.

Another problem is that the IDENTITY values are non-deterministic. Given this table:

These three insertions are logically identical,

But it does not work that way. Likewise, deleting and re-inserting a row should not change a proper attribute’s value, but it will not work with an IDENTITY column.

A Cautionary Tale

“A man with a watch knows what time it is. A man with two watches is never sure.” – Segal’s Law

1208-New_taxi_medallion%20small.jpg

Does this mean that proper identifiers will guarantee data integrity? Nope, no system is perfect. Let me tell you a tale about a taxi company and their motor pool. The taxi company had four identifiers for its motor pool. The industry standard one was the VIN (Vehicle Identification Number, ISO Standard 3779) since that is required by law for titles, tags, insurance and other things. It is 17 alphanumeric characters long. It is broken into fixed length fields that tell you the manufacturer, the plant, make, model and year of the vehicle. In the old days, the VIN was on the engine block where only a mechanic could see it. Today, it appears on a metal tag on the dash board and, for some cars, etched into all the window glass.

While having the VIN where it can be seen is good for theft prevention and detection, the truth is that most people have no idea what the VINs for their vehicles are. When a passenger steps into a taxi, he is not going to read and remember the VIN. But it is permanent and universal.

The license tag was another identifier. In particular, tags for taxis had a special format, so you could tell that they were a commercial vehicle. While not as permanent as the VIN, the tag would stay with the vehicle while it was used commercially and renewed with a sticker every year. If the vehicle was retired from commercial service, it needed a new tag.

The passenger sees a large three-digit fleet number on the doors of taxi, next to the company name. It is a lot easier to say “Metro Cab #025” instead of trying to rattle off all 17 characters of a VIN or the 6 to 8 characters of a commercial auto tag. The fleet numbers are pretty much sequential, perhaps with a gap when a vehicle is taken out of service. Gaps were not desirable. The sequence also made assigning routine maintenance easier — look at the calender date and the last two digits of your fleet number. When they match, get an oil change, rotate the tires and vacuum the back seat.

The fourth identifier is a city-issued medallion number. This method of raising revenue actually goes back to England and ferry boats. The city mints a fixed number of the medallions and then auctions them. In those days, they were literally a metal disk. The taxi driver has to display the medallion to lawfully operate in the city. Since there are only a fixed number of medallions, they become extremely valuable. In 2006, for example, a New York City medallion was worth about $300, 000.00 when it went for auction.

For the youngsters in my readership, Checker Cabs were an American company that made one model vehicle for decades — the American version of the famous London taxis. Besides having big rear seating, a huge trunk, a flat rear foot well and other features for it specialized work, the fenders and doors were designed to swap out with a few screws. The idea was that the motor pool cloud handle simple accidents and get the vehicle back in service.

Supposedly, each of these identifiers (VIN, auto tag, fleet number and medallion) should map to one and only one vehicle. Now here is where the story gets interesting. The motor pool supervisor realized that a certain percentage of the fleet is in the shop at all times. If you take the medallion off of a taxi in a service bay, you can loan it to your partner in crime, instead of putting in the safe.

The scam was to rotate auto tags and doors to give the appearance of a properly run fleet of taxis. It was done manually because this was before personal computers were common. It is a good programming problem since the maintenance logs have to balance and every vehicle had to look as if it were in service most of the time.

The scam fell apart when a taxi was pulled over by a traffic cop on a routine traffic violation. The officer noticed that the fleet numbers were different on the driver and passenger side doors. He then checked the medallion, auto tags and VIN against the city records. Nothing matched.

Conclusion

There are lots of morals here:

  1. Learn what an identifier is
  2. Un-learn what a locator/link/pointer is
  3. Think in sets not sequences
  4. Use industry standards to get validation and verification
  5. Don’t wear two watches unless you can keep them in synch with a radio time signal.