Validation, Verification, and Modification

Comments 1

Share to social media

Over the years, I’ve stressed that a schema should support validation and verification of its data, but I’ve never really given a good discussion of what those terms mean. I’d like to add another desirable property, modification, to this list. These are things that have to be built into the design when you start, and not tacked on as you go. The best trick is not to put bad data in your schema in the first place. Validation is done when you set up the schema. Verification is done after you have the schema and the data. Modification is done as the external world changes around you.

Validation

Validation simply says that the data element has the correct format. It doesn’t say anything about it being correct or even making sense. This is a ‘syntax versus semantics’ kind of thing. The sentence “green dreams swim furiously” is a valid English sentence because it has a subject and verb with appropriately placed adjectives and adverbs. And it is nonsense.

Likewise, ‘41BEV32K25U004671’ is a valid automobile vehicle identification number, but a change in one character and it will still have the correct format but represent a nonexistent entity. There is no such thing as a ‘Lamborghini Civic’ or a ‘Honda Diablo’ and there never will be. The first thing to establish is that the attribute actually exists as a property of a particular entity. This is not always obvious.

In short, validation includes the concept of a domain from which these valid values can be drawn. I’m sure that most of my readers have been in the trade long enough to have heard someone say “Oh yeah! I forgot about that!” in the middle of a project.

I once had to fill out a form that included a physical description where ‘bald’ or N/A were not options for describing your hair color. Then I had a friend who had Heterochromia iridium (two eyes of different colors) that really messed up the Human Resources department.

For the purposes of this article, I’m going to assume that the DDL (data definition language) has covered the domain. Now going to go on to the second phase of validation: is the format in which it is presented in the database correct? SQL has a lot of advantages over filesystems. Strongly typed languages guarantee that at least you must pick an appropriate data type. SQL goes one step further by allowing you to put constraints on the declarations so that you can get a proper domain for a data element. The various data types available to you in SQL will require slightly different tricks to give validations.

Numeric Data

A numeric value represents quantity, magnitude, position, sequence or a unit on a scale. It is not an identifier. It is not a string. You can do arithmetic on a numeric value, but it makes no sense to do such operations on an identifier or string. Would you take the square root of your credit card number and expect to get any meaningful data out of it? How about the third position in a waiting line divided by the seventh position in a waiting line?

Decimal numbers need to be declared within the precision and scale needed to represent your data accurately. One of ‘Mother Celko’s heuristics’ is that a number impresses the reader to the square of the number of decimal places. That means I might be doing work in which it is perfectly fine for me to represent Pi as 3.14, but it sure looks more scientific if you use 3.141592653 instead. Validation of computations is another issue in itself. For now, I just want to deal with the raw data.

Frankly, I would prefer not ever to use floating-point math again. I began my career as a Fortran programmer, so many decades ago, and I’m all too aware of the problems of doing floating-point math. However, the admonition about not allowing negative numbers in floating-point column is also worth considering.

Gerald Weinberg is one of the founders of modern software engineering (The Psychology of Computer Programming: Silver Anniversary Edition; ISBN-13: 978-0932633422). One piece of advice from him was to add ‘reasonableness checks’ to code, to prevent bad data. The concept of reasonableness must have a context. For example, it’s perfectly reasonable for the Dairy Queen chain to buy 5000 bananas. In fact,that might even be a little low for them, given the number of banana splits their stores make in a given day. However, I can say with absolute certainty that I will never buy or consume 5000 bananas in my life.

Weinberg recommended that part of the application code should ask the user if the value he has input is correct when it’s out of a particular range. Since this is where the application comes into the data, I don’t really care about it; I just want to make sure they don’t exclude valid extreme values in the database.

The most common validation failure is with integer data types. Look over your schema and you will see that most of the time these columns cannot be negative or must at least be one. We just don’t do this, in spite of the fact a simple CHECK(col >= 0) or CHECK(col > 0) constraint would prevent the problems. Such a constraint also passes information to the optimizer and could improve your query performance.

One clever solution to this is in Nicholas Wirth’s Modula-2 language; it has a CARDINAL data type, as well as an INTEGER data type. This has two major advantages: it guarantees that you will not get an unwanted negative value in the data field and it lets a compiler pick a different internal representation which might have some advantages.

DATE Data Types

SQL is the first language to have temporal data types. Validating a data type should not be a problem unless for some reason you try to represent it as character strings or numerics. The real problem with dates is that people use a local dialect to display them instead of the ANSI/ISO Standard ISO 8601 format. This is the only format allowed in standard SQL and for quite a few other ANSI/ISO compliant languages. Unfortunately, SQL Server has settings that allow you to display dates in the local dialect. The historical reasons for this have to do the fact that, at one point, COBOL was the only language that was used for commercial use and the only one that anyone really used to any serious work with dates. The bad news was that not having a native temporal data type of any sort, COBOL had to represent dates as strings (actually, everything in COBOL is a string). And you had to write your own routines for handling date arithmetic.

Data validation comes in two forms: internal to the database and external to the real world. An internal validation consists of checking to see that date values for an entity are in the correct temporal sequence. That means we don’t finish a project before we start it, we don’t get divorced before we get married, etc. External validation checks a date against the real world. If your company did not exist before a particular date, then you shouldn’t have any company data before that date. Yes, I know that sounds obvious, but you’d be surprised how easy it is to make a typo and suddenly have billings going back to 1417, not 2017, or insert an event with a due date a hundred years from now (well, that is possibly a very long-term lease, but it is usually a mistake).

At another level, did you show transactions on holidays when you were not open or show a date outside of a legally required waiting period? A simple ordinal date column in your calendar table can make some very complicated temporal computations much, much easier.

String Data

The simple situation is where the allowed string values are drawn from a limited set. This means they can be validated with a CHECK() constraint that uses an IN() predicate. The general rule of thumb is you use the IN() predicate when the list of valid values is short (whatever short means these days) and relatively static. For example, a list of the two letter state codes would meet these criteria, since there are less than 60 of them (50 states, possessions and territories, and military postal addresses).

The case of string data whose domain is more general or too large to put in a list, we would really like to enforce validation with a CHECK() constraint that uses a regular expression. In the original SQL Standards, this meant the LIKE predicate. Later, Standard SQL added the SIMILAR TO predicate, based on the POSIX version of the grep utility. The reason for this decision was that the US federal government was using POSIX at the time.

Verification

Verification the ability to take a valid value from the domain and compared to some external source or test it in some way to see that there actually is an entity that has that attribute. In the face of it, this requirement sounds silly, but it really needs to be checked. You need to know where to go for verification, and that it is appropriate for what you are doing.

Verification implies some trusted source or sources. One approach is to have a centralized registration authority and the other is to have multiple registered trusted sources.

VIN

Consider the Vehicle Identification Number (VIN). It is a 17 character string used by virtually all of the manufacturers of vehicles. Officially, its definition is controlled by ISO (International Organization for Standardization) as standard 3779:2009. The manufacturer provides a serial number within this string. The way you verify this string is to go out to the vehicle and look for it. The manufacturers put it somewhere on the physical vehicle such as stamped on a metal tag, etched in the glass, or provided in some way that a person can find it and read it directly. People in the automotive industry can determine the make and model from the VIN because it is a ‘smart’ identifier, which contains some meaning.

ISAN

There is another approach to take which will require a centralized authority. International Standard Audiovisual Number (ISAN) is a unique identifier for audiovisual works and related versions, similar to the ISBN for books. It was developed within an ISO TC46/SC9 working group. The ISAN code itself is a ‘dumb’ number, meaning that it does not include any signifying elements like the VIN. Its purpose is to uniquely identify the work, without descriptive information about the audiovisual content.

One of the main problems has been that there are so many audiovisual formats and packaging. The same work can be a movie, a TV show, an MP4 file, a comic book and whatever else we invent next week. For example, the Disney cartoon “Frozen” was dubbed in 41 different languages (http://www.globallanguageservices.co.uk/languages-of-frozen/). Decades ago, I did a consulting job with the Belgian National Radio and Television Network (BRTN, originally BRT and later VRT). The Belgians split on language lines continues with two separate state broadcasters providing two services each in their own language. The one benefit I observed was in the cafeteria; the Dutch speakers made sure that there was good beer and the French speakers made sure there was good food and wine. However, after that, having the two IT systems on completely different operating systems and software made data exchange between one side of the building and the other difficult. There was no standard way of identifying video content for the episodes in a purchased show. Since each side of the house had its own computer system, there was no way to find out if you were licensing the same product in Dutch or in French or some third language.

Structure of ISAN

The ISAN has nothing in and of itself that gives information about the work. It is a locator within a central depository that is used for verification. Root and episode portions identify series and the episode identifies works within series.

This basic form of the standard is a 64-bit string, 48 bits for the root and 16 bits for the episode, expressed as 16 hexadecimal digits. There are check digits for humans, for display of ISAN and data entry. The basic display structure is a prefix of ISAN and four units of four hexadecimal digits, separated by dashes.

The express purpose of this format is to provide a root capable of holding 281,474,976,710,656 identifiers, with up to 65,535 episode identifiers. Later standards add extensions for (root, episode, version). The version portion differentiates individual versions, using a 96-bit number, with 48 bits for root 16 bits for episode, 32 bits for version, expressed as 24 hexadecimal digits. The original ISAN with its check digit, X, is kept and a second check digit, Y, is added to the version string attached on the end of it. This means that the prefix of this string can stand as a unit in itself.

This standard was set up of the objective allowing 150 years of use. By virtue of being in ISO standard, it can predictably be used for legal purposes. Since it’s controlled outside of any of the organizations producing the audiovisual content, it will survive any changes in producing / owning organizations. This standard does not support things that are purely data or purely audio (some of these items are covered by other ISO registration systems).

Bit Chain

Bit chain technology is a new player in the game, Thanks to bit coins. The value can be run through an algorithm, and this process gives you both validation and verification, but there’s no need for a central authority. I am not going to even begin to discuss the mathematics of these systems, but they’re going to become more and more important the next few years.

Modification

Modification problems usually come with identifiers, rather than non-key attributes. Enlarging an existing fixed length column is always going to be slow and painful, but it is usually just one alter table statement. Think about what else must change. Besides the column, indexes, foreign key references and constraints will have to be redone. A recent column at spaghettidba.com is a good example of how to minimize the pain in SQL Server (http://spaghettidba.com/2017/11/15/how-to-enlarge-your-columns-with-no-downtime/)

His example is moving a column from SMALLINT to INTEGER. If you try to enlarge this column with a straight ALTER TABLE command, you will have to wait for SQL Server to go through all the rows and write the new data type. SMALLINT is stored in 2 bytes, while INTEGER requires 4 bytes, so SQL Server will have to enlarge each and every row to accommodate 2 extra bytes.

This operation requires some time and also causes a lot of page splits, especially if the pages are completely full. But even worse, the access to the table is prevented while the alter statement is running. This means you probably want to alter the table when everybody has gone home for the night.

To overcome these problems, the author shows you how to use Row Compression. This SQL Server feature used to be expensive, but it got cheap in SQL Server 2016 SP1 and later. The syntax is straightforward and the alter statement would look like this:

This time the command to alter the column completes instantly because the ALTER COLUMN statement is a metadata only change. The compressed SMALLINT data is decompressed as INTEGER data. You really need to read the original article to get all the caveats.

Identifiers

Standard identifiers are the most important things to modify. How you’re going to find your data from external and internal sources in the future. Locating everywhere they occur in your schema, changing all the forms and input screens, and all the other housekeeping you have to do is difficult. However, be grateful that standard identifiers do not change all that often. If you’re lucky, the new identifier can be computed from the old identifier. If you’re not so lucky, you may have to carry both the old and the new identifiers in the table and put in a lot of logic in the front end to make sure you’re using the right one. database people are lucky in some ways; the poor applications programmers have to change paper forms, reallocate files, redesign screens and make all kinds of changes to the physical storage of their data. In a well-designed schema, DRI actions, constraints and various tools let us work with the data, at a more abstract level.

ISBN

If you have a book sitting around, look on the back and you’ll see two International Standard Book Number codes (ISBN–13 and ISBN–10). The ISBN–10 is the original form of this coding that came out of the UK many years ago. It served the book industry quite well, however, the book industry decided to make it part of the EAN barcode system for retail merchandise. This encoding scheme uses 13 digits, which begins with three digits that define something about the merchandise. In the case of books, those three digits are 978. The original ISBN used a mod 11 check digit in the final position. This meant it was possible to get a letter X (roman numeral ten). The EAN system, however, uses a Luhn check digit, which is inferior to the original system. The rest of the digits encode the language, publisher, and then the book within publisher.

Modifying the original ISBN is straightforward. Remove the original check digit, affix 978 on the front, and then compute the new check digit. It’s actually pretty good little programming problem on how to handle strings in SQL. I can put the new ISBN in a computed column! A lot of programming languages do not let you do that.

UPC, EAN and GTIN

The first barcoded retail product was Wrigley’s chewing gum. It is in the Smithsonian Institute if you want to look at it. Today, there are virtually no retail goods without some kind of barcode on them anywhere on earth.

The UPC (Universal Product Code), EAN (European Article Number or now International Article Number) and the GTIN (Global Trade Item Number) are all part of a general barcode standard for retail. They are 12, 13 and 14 digits long. The UPC is used in the United States and Canada, and assumes that those are the two countries involved. The EAN is used internationally, and includes a code for country. These are all part of the GTIN System.

Bouncing around among various barcodes in this system is relatively simple. Yes, you have to watch out for barcodes that have been retired, or which changed product, but it is relatively certain that a given barcode will refer to one and only one product in a given timeframe. The worst case is having mixed identifiers that referred to the same product at the same time, or totally different products at the same time.

ASIN

Mixed identifiers are the worst possible situation. Everything potentially could have an industry-standard and local identifier. The only way I know to get around it is with a lookup table. Does anyone remember Borders books? They had their proprietary Book Inventory Systems (BIS). They were able to sell it to independent bookstores in various regions, but it is expensive to have to put your own pressure labels on books that already have an ISBN preprinted on them. And that’s one of the reasons that Borders went bankrupt.

Amazon has its own internal product identification system. The ASIN (Amazon Standard Identification Number). A retailer can either attach their product to an existing ASIN (of the same product from a different seller) or you can register your brand on Amazon and get new ASINs assigned. This helps both you and the marketplace keep track of inventory and prevents them from accepting orders for products that have run out. The catch here is that unlike a UPC, an ASIN is not necessarily unique to a product. The same product on Amazon US and Amazon UK could have different ASINs on these sites.

This identifier is 10 characters long and can be alphanumeric, not just digits. The only time an ASIN is universal is when it is matched to a book’s ISBN. Consequently, if you’re selling books, you don’t need to worry about ASIN being a different number. It is almost always the same as the book’s ISBN. This reflects where Amazon started its business.

Mother Celko’s Heuristics

Everything I’ve discussed so far is dependent on the availability of industry standards. But much as you want to avoid it, sometimes you have to invent your own encoding schemes. If you’re a big Donald Knuth fan, you might remember the satire that he wrote as a high school kid for Mad Magazine #33, “The The Potrzebie System of Weights and Measures” (https://en.wikipedia.org/wiki/Potrzebie) a parody of the metric system. Knuth’s parody is probably better designed than most of the homemade encodings I’ve seen over the years.

When you have to come up with something new, don’t just making a list and sequentially number it. Take the time to actually design what you’re doing. Some of the practical tips

  • Go back and look for an existing standard that’s close to what you want and see if you can make it fit. There might be a subset of a larger standard that you can use. When it comes time to measurements, you don’t need the whole range of the temperature scale. If you’re trying to record weather reports, the planet Earth has lows below freezing, but nowhere near absolute zero. Nor does it have anything much over 45°C, and nowhere i near the melting point of steel.
  • If you honestly have to build something from scratch, then look at the Unicode And ISO standards for encoding schemes. All the languages supported by Unicode must accept a basic subset of ASCII characters consisting of letters, digits, and a small set of punctuation marks. This is why the metric system, as well as every other ISO standard, can be written anywhere on earth. We don’t use special symbols.
  • It is also important to document things. Yes, yes, I know people have been yelling this at you for decades, but you might want to actually listen to them. Explain why you created this new system, and create a defining source for it. One of the classic horror stories from the Department of Justice involved in early word processing system. Most of their documents used ‘boilerplate’ paragraphs which the law clerks referred to by a simple code. Unfortunately, there was no central defining authority, so there were several versions of each boilerplate depending on which word processor you logged into. Can you see problems?

Take your time, and recite your goals as you work. You want to verify, validate and be able to modify whatever you do.

About the author

Joe Celko is one of the most widely read of all writers about SQL, and was the winner of the DBMS Magazine Reader's Choice Award four consecutive years. He is an independent consultant living in Austin, TX. He has taught SQL in the US, UK, the Nordic countries, South America and Africa.
He served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards.
He has written over 800 columns in the computer trade and academic press, mostly dealing with data and databases. He is the author of eight books on SQL for Morgan-Kaufmann, including the best selling SQL FOR SMARTIES.
Joe is a well-known figure on Newsgroups and Forums, and he is famous for his his dry wit. He is also interested in Science Fiction.

Joe's contributions