Check your Digits

The most persistent struggle in data processing has been to ensure clean data. There are many ways that data can be incorrect and a database must check, as best it can, that the data is correct. The CHECK constraint is ideally suited for this sort of work, and the checking routine can become quite complex when dealing with check digits in data.

Charles Babbage, the father of the computer, observed in the mid-1800s that an inseparable part of the cost of obtaining data is the cost of verifying its accuracy. This was actually the motivation for his “difference engine”, the proto-computer. The navigation tables used by the British Navy were full of errors because they were computed by human beings and then set in type by human beings. Machinery would fix that forever.

Well, we know how well that worked out. One of, if not the biggest, problems in the world of data is still dirty data. The ideal is still that dirty data never gets into your database.

Statistics classifies errors as either Type I or Type II. A Type I error rejects truth and a Type II error accepts a falsehood. In a database, a Type I error would be a failure to get valid data into the schema. My favorite example is having my email address rejected because of an error in the code, so I cannot register at a website.

But a Type II error is harder to detect. Some of these errors do require that the data gets to the database in order to be checked against other internal data (“Mr. Celko, your checking account is over-drawn!”) or even checked against external data (“Mr. Celko, the FBI froze your account!”). Most commonly, the Type II error is a human error which can be detected at input time. Humans are still providing data.

F. J. Damerau (Damerau 1964) reported that four common input errors cause 80% of the total spelling errors:

  1. A single missing character
  2. A single extra character
  3. A single erroneous character
  4. Pairwise-transposed characters

The single erroneous, missing, or extra character explained 60% to 95% of all errors in his sample of 12,000 errors; pairwise transposes accounted for 10% to 20% of the total.

The first three categories can be expanded to more than a single character, but the single-character cases are by far the most common. In the last category, pairwise transposes (“ab” becomes “ba”) are far more common than jump transposes (transposes of pairs with one or more characters between them, as when “abc” becomes “cba”). This is because we use keyboards for data entry and your fingers can get ahead of each other.

This research dealt with typing text, but the principle applies to data entry as well. We have some advantages structured data versus free text. Encoding schemes usually have a limited list of values, a fixed length and/or a regular expression. You do not need the Oxford English Dictionary and Artificual Intelligence to fill out an on-line form. Well, maybe you do if it is government-related.

Check Digits

Check digits are an old technique to get the data to verify itself at data entry time without having to go to a database or other external source. By applying a self-contained procedure to a numeric code, we can tell if that code is valid or not. Being valid is not the same being correct; I can construct a valid ISBN (International Standard Book Number) for a book that does not really exist.

The original function was in bar codes and other machine readable media to be sure that the machinery got valid input. The distinction between error-detecting and error-correcting codes is worth mentioning. The error-detecting code will find that an encoding is invalid, but gives no help in finding the error itself. An error-correcting code will try to repair the problem. Error-correcting schemes for binary numbers play an important part in highly reliable computers, but require several extra digits on each computer word to work. If you would like to do research on error-correction codes some of the algorithms are:

Hamming codes
Fire codes
Bose-Chandhuri-Hocquenghem (BCH) codes
Reed-Solomon (RS) codes
Goppa codes

On the other hand, error detection can be done with only one extra digit and it is important to people who design codes for a database because they keep the data clean without triggers or procedures by simply excluding bad data. Ideally, we want the algorithms written in CHECK() clauses. That gives us declarative code that the optimizer can use, portable SQL and the certainty that it is being done one way, one place, one time.

Classes of Algorithms

Check digits are used with tag numbers; a tag number is an identifier represented as a string and made up of digits. You do not do math on tag numbers as such. But you can take each digit in the string, cast it as an integer, do some maths and come up with a check digit. If the identifier has alphabetic characters, you need a way to cast them as integer values.

The most common check digit procedures come in a few broad classes. One class takes the individual digits, multiplies them by a constant value (called a weight) for each position, sums the results, divides the sum by another constant, and uses the remainder as the check digit. These are called weighted-sum algorithms.

Another approach is to use functions based on group theory, a branch of abstract algebra; these are called algebraic algorithms. A discussion of group theory is a little too complex to take up here, so I will do a little hand-waving when I get to the mathematics. Finally, you can use look-up tables for check digit functions that cannot be easily calculated.

The look-up tables can be almost anything, including functions that are tuned for the data in a particular application.

Weighted-Sum Algorithms

Weighted-sum algorithms are probably the most common class of check digit. They have the advantages of being easy to compute by hand, since they require no tables or complex arithmetic, so they were first used in manual systems.

To calculate a weighted sum check digit:

  1. Multiply each of the digits in the encoding by a weight. A weight is a positive integer value.
  2. Add the products of the above multiplications to get a sum, cal it (s).
  3. 3) Take that sum (s) and apply a function to it. The function is usually (s % n) where (n is a prime number and n <= 10), but it can be more complicated. An exception in this step is to allow the letter ‘X’ (Roman numeral ten) as the result of a MOD(s, 11) function. This is a very strong check digit and was used in the old ISBN (International Standard Book Number).
  4. The check digit is concatenated to the encoding. It usually gets [laced on the right hand side because machines and English speakers read left-to-right

This is one of the most popular check digit procedures. It is easy to implement in hardware or software. It will detect most of the single character and pairwise transpose errors. However, it is not perfect.

Consider the Bank Check Digit, whose weights are 3, 7, and 1, repeated as needed from left to right with a (s % 10) function. This is used in the United States on personal checks, where the bank processing numbers have eight information digits. Look at the lower-left-hand corner of your checkbook in the MICR (Magnetic Ink Character Recognition) numbers for your bank’s code. The formula uses the check digit itself in the formula, so that the result should be a constant zero for correct numbers. Otherwise, you could use “10 – MOD(total, 10) = check digit” for your formula.

This scheme fails when the digits of a ‘pairwise transpose’ differ by 5. For example, imagine that we wanted to validate the number 1621, but we typed 6121 instead, swapping the first two digits.

Since (6-1) = 5, this algorithm cannot detect the problem. Here is the arithmetic:

A better scheme is the IBM Check or Luhn algorithm, whose weights alternate between 1 and f(x), where f(x) is defined by the look-up table given below or by the formula f(x) = IF (x < 9) THEN ((x + x)% 9) ELSE 9, where (x) is the position of the digit in the code. It was patented by IBM scientist Hans Peter Luhn in 1960 and a form is widely used today in the EAN and UPC codes.

The look-up table is usually faster than doing the arithmetic since it is small and can take advantage of indexing and parallel processing. Obviously, the look-up table needs to have as many rows as digits in the encoding.

DB2 has a special optimization which detects Star schemas by looking for a large fact table with many smaller dimension tables referenced by it. This works nicely with this kind of query.

Another popular version of the weighted-sum check digit is the Bull codes, which use the sum of two alternating sums, each with a modulus less than 10. The modulus pair has to be relatively prime. The most popular pairs, in order of increasing error detection ability, are (4, 5), (4, 7), (3, 7), (3, 5), (5, 6) and (3, 8).

For example, using the pair (4, 5) and modulus 7, we could check the code 2345-1 with these calculations :((2*4)+ (3*5) + (4*4) + (5*5)) = 64 % 7 = 1

Power-Sum Check Digits

The weights can be defined as variable powers of a fixed base number and then apply a modulus to get the remainder. Prime numbers are the best modulus, but 10 is very common. The most common schemes use a base of 2 or 3 with a modulus of 7, 10, or 11. The combination of 2 and 11 with a separate symbol for a remainder of 10 is one of these types of check digit. For example, we could check the code 2350 with these calculations:

You can prove that any pair of weights, a and b, for which it is true that b = a + 2n and n is an integer, suffer from the fault that they do not detect transpose errors that differ by five.

The difference between the two is (10*n), thus they have the same remainder when dividing by 10.

Fancy Check Digits

A very good, but somewhat complicated, scheme was proposed by J. Verhoeff in a tract from the MATHEMATICAL CENTRE in Amsterdam, Netherlands (Verhoeff 1969). It is based on the properties of multiplication in an algebraic structure known as the Dihedral Five Group (D5 in mathematician’s terms). The important thing to notice is that D5 multiplication does not always commute, for example (8 ¤ 9) = 4 and (9 ¤ 8) = 1. This property is what lets it detect transposition errors that other methods miss.

If you are really interested, the look up the Wikipedia article.

The old ISBN-10 check digit has been replaced with a Luhn code in the ISBN-13)now just called the ISBN). The change was made to get books into the same encoding scheme as the rest of retail sales. The bad news is that the old check digit was much stronger than the new one. The ISBN-10 was ten digits long, hence the name. Starting at the left side as position 1, each digit was multiplied by its position number and added. The total was taken MOD 11 to become the tenth and final digit. If the remainder was 10, the character ‘X’ (Roman Numeral ten) was used.

Declarations, not Functions, not Procedures

After having given all of these algorithms, you should not use them in procedural code in your schema. Convert them to constraints instead. This is an example of thinking in sets and not procedures. The conversion is fairly straight forward. First be sure that you have all the digits in the string as a separate constraint. The actual check digit validation takes the n-th as a SUBSTRING(instring, <n>, 1) and casts it as a integer and multiplies it by the appropriate weight. Do the additions and the MOD() function to see if it is equal to the value of the last digit in the column.

Here is a simple example, using the position in the string as the weight and MOD 10. This is an actual algorithm which was later modified into the old ISBN algorithm I mentioned.

The reason for splitting the code into two constraints is to provide better error messages. That is how we think in SQL. Avoid procedural code in favor of declarative code.

As an exercise, try doing the Luhn algorithm with CASE expressions instead of the look-up table. The CHECK() constraint is a little long, but pretty fast.