BIT of a Problem

The BIT data type is an awkward fit for a SQL database. It doesn't have just two values, and it can do unexpected things in expressions. What is worse, it is a flag rather than a predicate, and so its overuse, along with bit masks, is a prime candidate for being listed as a 'SQL Code Smell'. Joe Celko makes the case.

Once upon a time, long, long time ago, programming languages were tied to the hardware. Obviously, assembly and machine languages were designed for one particular family of computers. But even the higher level languages still clung to the hardware. FORTRAN and COBOL standards were defined on the assumption that files were sequential, main storage was contiguous and all machine were binary. The hardware affected the design of the languages. My favorite was the three-way IF in FORTRAN that looked like this:

Depending on the signum (-1, 0, +1) of the numeric expression, control jumped to one of the three labels. Why did this exist as a separate command? Because one of the first FORTRAN compilers was written for an IBM machine that had that particular jump built into the hardware as a single command. Likewise, all of the weird pre- and post- increments in C were based on the DEC PDP-11 and VAX hardware.

The hardware used a lot of bit flags and bit masks for control inside the machine. Flags were also used in the file system to mark deleted records and other things. Storage was expensive and very slow by today’s standards. Bit flags were small and relatively fast to write.

Since this was the style of coding we saw, we tended to mimic that programming style in higher level code. The bad news is that even as our programming languages became more abstract, the programmer’s mind set did not.

When we got to SQL, we were in a very different game. The BIT data type in T-SQL was originally a bit in the way that computer people think of it; one or zero indicating an open or closed circuit. Code was written on the assumption that if a BIT column wasn’t zero, it was one and if it wasn’t one, it was zero.

Then BIT became a numeric data type. All data types in SQL are NULL-able. This change caught a lot of old T-SQL programmers off-guard. Unexpected NULLs showed up. In spite of being a numeric data type, you cannot do any math on it. Try this simple expression:

and you get “Msg 8117, Level 16, State 1: Operand data type bit is invalid for add operator.” instead of the expected answer one. Likewise, the expression “CAST (‘1’ AS BIT)& CAST (NULL AS BIT)” will return a NULL instead of an error message that bit operators are not defined on NULLs. Try to wrap your head around what a NULL bit would mean. A quantum state?

This is even worse with bit masks. They are defined by hardware-specific, physical implementations. Are you on a high-end or low-end machine? Does the machine have 8, 16, 32, 64, or 128 bit words? Twos complement or ones complement math? Can a single bit in the mask be NULL-able? Hey, the SQL Standards say nothing about hardware, so bits do not have to exist at all!

So, BIT is weird from a conceptual SQL viewpoint. It is simply too physical for an abstract language.

What about the host languages? Did you know that +1, -1, +0, and -0 are all used for BOOLEANs in various implementations of various languages? In C#, Boolean values are 0/1 for FALSE/TRUE, whereas VB.NET has Boolean values of 0/-1 for FALSE/TRUE and they are proprietary languages from the same vendor. That means all the host languages — present, future and not-yet-defined — can be different.

These conversion problems and overhead do not occur with other data types. What about coding with them inside the SQL schema and not worry about communicating to a host program?

There are usually two situations for using bits in practice.

Either the bits are individual attributes or they are used as a bit mask or vector to represent a single consolidated status. In the case of a single attribute, you find yourself quickly blocked. For example, doing a survey with “yes/no” questions starts off fine, then chokes when you need to indicate “No Answer given”, “Not Applicable”, “Contradicts a previous answer” and other values. Even the ISO sex codes have four values (0 = unknown, 1 = male, 2 = female, 9 = lawful person. such as a corporation).

In the second case, you are playing the child’s game “20 Questions” in SQL. Imagine you have six components for a loan approval, so you allocate bits in your schema. You have 64 possible vectors, but, say, only five of the permutations are valid(i.e. you cannot be rejected for bankruptcy and still have good credit). For your data integrity, you can:

  1. Ignore the problem. This is actually what most people who are less familiar with the repercussions will do. I have spent three decades cleaning up bad SQL and I see it all the time. I cannot recommend it.
  2. Write elaborate CHECK() constraints with user-defined functions or proprietary BIT-level library functions that cannot port and that run like cold glue.

Now we add a seventh condition to the vector. It just keeps getting worse.

Single Bit Flags

Scott Hanebutt posted a forum piece on a package that had been moved to SQL. I am going to paraphrase his problem; I just want you to know I did not invent this schema as a straw man. You have a table of product releases. Each product can have multiple releases. There is a flag on each release. If the product has multiple releases then master_release_flg = CAST(1 AS BIT) for the unique master release and CAST(0 AS BIT) for the individual releases. If an item does not have multiple releases then master_release_flg = CAST(0 AS BIT).

The original posting had some complicated reporting requirements, but let me ask that you perform these tasks on the basic data model.

  1. Write declarative DDL that enforces the flag constraint. No triggers, no procedural code.
  2. Write a VIEW that shows the master release for all the product.
  3. If there were two product releases, then deleting the current master will leave the other release alone and still flagged zero. That meets the flag definition and we are okay. But if there were more than two releases for product, then we need a rule for promotion. Again, fix this without procedural code.

The usual idiom for this sort of thing uses an integer to set up a successor. Let’s agree that zero is where the master releases start.

That meets criteria one; this is pure declarative code. The VIEW to show the Master products is easy.

It never needs updating, either. But you might want to close up gaps, just to make it pretty.

Bit Mask Coding

The most obvious problems with using an INTEGER (SMALLINT or BIGINT) as a bit mask are:

  1. The data is unreadable. Can you easily figure out what each bit means by looking at without a guide? Looking at “WHERE auction_status & 42 <> 0” is not very clear to a maintenance programmer. This is why we design encoding schemes like Dewey Decimal codes for libraries; they are easy to read and to use.
  2. Constraints are a bitch to write. The two choices are to use a lot of proprietary bit-wise and/or operators that are hard to optimize, or to a set of INTEGER values with a [NOT] IN() predicate.
  3. You are limited to two values per field. That is very restrictive; even the ISO sex code cannot fit into such a column. What is fun is using two or more fields for more values. You start off with {00, 01 ,10, 11} in contiguous positions. But when need more values, the next bit is not contiguous and the predicates are really horrible.

    Think about a “is_completed_flg” bit flag on an auction. Did it complete because a bid was accepted? Because the bid was withdrawn? Because the breserve price was not met? Because the item was withdrawn? Because it expired?

  4. There is no temporal element to the bit mask (or to single bit flags). For example, a flag “is_legal_adult_flg” does not tell you if the person is 18 or 80. Nor does it tell you exactly what kind of legal adult is involved. In Texas, you can be a stripper or porn star at 18 years of age, but cannot buy or serve a drink until you are 21. A DATE for the birth date (just 3 bytes) would hold complete fact and let us compute what we need to know; it would always be correct, too. How do you know a bit flag is still current?
  5. You will find out that using the flags will tend to split the status of an entity over multiple tables. Let me give an actual example in one of my consulting jobs. The client conducts on-line auctions. A request for bids is posted, then it moves thru a series of steps over time until it expires, is withdrawn or finds a winning bid. The auction can be in about a dozen states, but the state changes have an ordering. A bid cannot be rejected until it is made, and so forth.

    In the old system, it was necessary to go to the Invoices table, and see if a payment had been made on a bid. The date of the payment of the invoice sets one of many bit flags that were supposed to tell us the status of auctions. Now go over to shipments and see that the matching shipment left after the invoice was paid. You get the idea; tracking an auction involved between eight to ten tables.

    The bad news was that people who worked with one of the process steps did not know or much care about the big picture. Data integrity suffered. Trying to track an auction history was hard enough, but when a flag had not be set correctly, it became a nightmare.

  6. Bit flags invite redundancy. In the system I just mentioned, we had “is_active_flg” and “is_completed_flg” in in the same table. A completed auction is not active and vice verse. It is the same fact in two flags. Human psychology (and the English language) prefers to hear an affirmative wording (remember the old song “Yes, we have no bananas today!” ?).

    All of these bit flags, and sequence validation are being replaced by two sets of state transition tables, one for bids and one for shipments. For details on state transition constraints. The history of each auction is now in one place and has to follow business rules.

  7. By the time you disassemble a bit mask column, and throw out the fields you did not need performance is not going to be improved over simpler data types. Remember we have 64 bit machinery today and it does comparisons quite fast.
  8. Grouping and ordering on the individual fields is a real pain. Try it.
  9. 9) You you have to index the whole column, so unless you luck up and have them in the right order, you are stuck with table scans.
  10. Since a bit mask is not in First Normal Form (1NF), you have all the anomalies we wanted to avoid in RDBMS.

Predicates versus Flags

Don’t confuse flags with an attribute that happens to have just two values. The Rh blood factor which comes in “Rh+” and “Rh-” is one example. I can verify the value of this blood factor with a test; it is a property of an entity. A flag is “second-hand” knowledge.

SQL is a “predicate language” and not a “flag language” like assembler. We use predicates and declarative code to discover the current state of the database in queries and to maintain data integrity in DDL.

Let me go back to the example of finding personnel that are legal adults. With a flag, I look use something like “Personnel.legal_adult_flg = CAST(1 AS BIT)” as my test. I am assuming that the data is current. But if I used “WHERE DATEADD (YEAR, 18, Personnel.birth_date) <= CURRENT_TIMESTAMP”, I know my result set is right. And it is right every time I execute my query.

If the legal age changes to 21 for our purposes, as it did when the US raised the drinking age, then the code change is easy; I have two slightly different predicates for two kinds of legal age. Some insurance benefits can apply to children up to age 25, which gives us three kinds of legal age. The assembly language programer is busy trying to add more flags to his Personnel table. Then he has to update them with my predicates before every execution of his query. This is not saving you anything.