The Practical Problems of Determining Equality and Equivalence in SQL

In theory, it is easy to determine whether data is equal to, greater than or lesser than a value. Real-world data types can be surprisingly tricky, even for gauging data equivalence. Because SQL cannot remain aloof from any application's data types, you need to be aware of the ways and pitfalls of individual datatypes and how you can test for equality, equivalence, similarity, identity and all that jazz

If you had a really good abstract math course in college, you would have run into the concept of equivalence classes. Essentially, this says ‘Take a set of things, apply an operator of some kind on the set, and it gets partitioned into subsets that all share some property defined by the operator’.

Back in the early days of theoretical relational databases, we only had equi-joins. This was all very well for theoretical stuff and we could use this to define various simple relational operators. We had unions, intersections, set differences, functional dependencies, MVD, relational division, and flavors of outer joins. The problems, of course, only started when theory hit the real world where we actually needed data instead of abstract symbols.

When we start looking at real-world data types (integers of any size, floats in various flavors, strings in various encodings, dates and some vendor extensions), we find out that there is more to life than just the ‘equals’ (=) sign. The way that data types are implemented can make it surprisingly tricky in certain circumstances to determine whether values are equal.

The SQL Standard has a very wide range of numeric types. The idea is that any host application language can find an SQL numeric type that matches each of its own. Remember that SQL is meant to be used with a host language and not by itself.

Numbers in SQL are classified as being either exact or approximate. An exact numeric value has a precision, p, and a scale, s. The precision is a positive integer that determines the number of significant digits in a particular radix. The standard says the radix can be either binary or decimal, so you need to know what your implementation does. The scale is a non-negative integer that tells you how many radix places the number has. This has to be represented in some way in a database.

Today, there are not many base-ten platforms, so you almost certainly have a binary machine. However, a number can have one of many binary representations — twos-complement, ones-complement, high end or low end and various word sizes. Should we need to be aware of this? The proper mental model of numbers in SQL is not to worry about the “bits and bytes level” of the physical representation or the display format in the presentation layers, but to think in abstract terms. However, the implications of the way we represent data can bite us if we are unaware of them. Let me explain:

Data types and the unexpected

INTEGERS

The good old =, <, > and a few other simple operators work fine with integers. I am going to assume that everyone has found out about SQL’s INTEGER math by now. Probably, the hard way. If you have not tried writing “SELECT (5/2) AS test” in Management Studio (SSMS) and executing it, then you should do so, because you will not get 2.5.

You will certainly not get 2 ½ either. The reason that I mentioned the fractional form, instead of decimals, is historic. It actually took a while for decimals to come into common use. Early coins and paper money, and other commercial labels were done with fractions. Some of the early pocket calculators in the late 1960’s actually had an option for using fractions instead of decimals. But that is a display issue and not a data issue.

The reason that you do not get the right answer that you expect is that SQL is a strongly typed language– a very strongly typed language. Other programming languages will change data types given a calculation that cannot be done in the original data types of the expression. So in many languages you would get 2.5 because the compiler would cast the result to a higher data type, in this case from INTEGER to DECIMAL(s, p). The default casting could be determined by the language standard or a vendor implementation.

This is a tradition that goes way back to the time when teenagers wore flowers in their hair. PL/1 was an old “super language” that IBM tried to push on its customers in the 1960’s. It was a weird mix of FORTRAN, COBOL, and ALGOL. There were no reserved words in PL/1 and it attempted to compile anything you gave it by making corrections for you. If you think the assumptions made by spelling checkers are weird, you ain’t seen nothing yet. This was the era of peace, love and understanding in computing.

As with flower-power, it all came unstuck when reality impinged. Casting rules among the data types had no obvious pattern. However, various data types could be promoted without warning in any data structure. An INTEGER array could suddenly become a DECIMAL (p, s) array, then get promoted to floating-point, then DOUBLE PRECISION and finally (yes, I am not kidding) COMPLEX numbers. Remember, this was supposed to be the super language that would replace everything. It didn’t.

DECIMAL(p, s)

DECIMAL(p,s) can also be written DEC(p,s) . For example, DECIMAL(8,2) could be used to hold the number 123456.78, which has eight significant digits (precision) and two decimal places (scale).

The difference between NUMERIC(p,s) and DECIMAL(p,s) is subtle. NUMERIC(p,s) specifies the exact precision and scale to be used. DECIMAL(p,s) specifies the exact scale, but the precision is implementation-defined to be equal to or greater than the specified value. That means DECIMAL(p,s) can have some room for rounding and NUMERIC(p,s) does not.

Mainframe COBOL programmers can think of NUMERIC(p,s) as a PICTURE numeric type, whereas DECIMAL(p,s) is like BCD. “BCD” means “binary coded decimal”, which will probably not be seen in the future. It was popular on hardware that had to produce a numeric output, but save space. The digits were represented in a simple binary code for 0 to 9, and information about the sign of the number. If the bit patterns did not align two word boundaries, then you got extra precision. Assembly language programs could break it down and put it into a COBOL program as character data. But it was also possible to do math with BCD.

The use of BCD is not common today, but was popular on older mainframe business computers. I recommend using DECIMAL(p,s) because it might enjoy some extra precision. There is no difference in SQL Server, so the best practice is to use DECIMAL(p,s) so that if you port your code to another product or Microsoft has an update, you will be safe.

Money

The one data type you should never use is MONEY. This data type was added by Sybase for COBOL programmers using their new SQL Server. It does not display the formatting by adding currency signs, commas and periods. It actually is a decimal number under the covers with four decimal places. Almost.

Unfortunately, it does not do correct rounding of computations. You can ‘Google’ this and get some examples (https://social.msdn.microsoft.com/Forums/sqlserver/en-US/de0e5cfe-b984-4700-b81f-a0478a65daf1/difference-between-numeric-and-money-data-type-in-sql-server?forum=transactsql). The problem, basically, is that the old Sybase engine did rounding too soon. But today it is even worse. GAAP and EU regulations for commercial computations all have more required precision and correct rounding rules. Standard SQL never had anything like this, because we view data types as abstractions and not as COBOL picture data.

Approximate data types (REAL and FLOAT)

SQL originally had REAL, FLOAT and DOUBLE PRECISION data types for its approximate data types. Vendors had various representations for floating-point math. Some use to base 10 models some used binary. Word sizes varied with hardware. In short, these are very confusing times.

Today, if you use one of these datatypes, you are going to get the IEEE 754 and later standards. It is a large complicated document, and a very elaborate standard. Bluntly, if you do not need floating-point math, do not use it. I do not think anybody doing commercial work will ever need floating-point; it was created for engineers and scientific work. Its purpose was to save memory and allow complex calculations.

Again, I have no desire to give a full lesson on floating-point numbers. I do think that every programmer ought to find a website with detailed information and spend about three days on it. Or you can buy a lot of beer for an old FORTRAN programmer and ask him to explain it.

The fundamental concept in floating-point math that is hard to understand is the idea of an epsilon. Equality is not exactly equality in this world. If two numbers differ by less than some value (epsilon), then they are said to be equivalent. That means they will test equal. This also means that if you have a large floating-point number and add a very small floating-point number, it will be treated like adding zero. This can drive you nuts. Given a large set of small floating-point numbers, they can be ignored in aggregations and rounding gets out of control. Software packages that are based on floating-point get around these problems. One of two ways. Specialized hardware does the corrections, or a software library that you never see does it. But SQL is a database language, not a computational one, so I do not know of a single SQL product that has a full floating-point engine in it.

The IEEE standard also has things called “NaN”, which means “not a number” that have to have special rules. If you thought nulls were bad, you ain’t seen nothing yet. There are quiet NaNs and signaling NaNs. Quiet NaNs are used to propagate errors resulting from invalid operations or values, whereas signaling NaNs can support advanced features such as mixing numerical and symbolic computation or other extensions to basic floating-point arithmetic. For example, 0/0 is undefined as a real number, and so represented by NaN; the square root of a negative number is imaginary, and thus not representable as a real floating-point number, and so is represented by NaN; and NaNs may be used to represent missing values in computations. And we also have positive infinity and negative infinity.

The significand (also known as mantissa or coefficient) is the first part of floating-point number. It is then followed by a base marker which tells you with this is base 10,  base 2 or whatever (these days it is base 10) and then it is followed by the exponent. The exponent says to raise the significand to 10 to that power.

The notation for floating-point constants appeared first in FORTRAN. It consist of “<mantissa> E <exponent>” and it was later used in Algol 60 with a special symbol of a scripted 10 instead of a letter.

But here is the problem; we have dialects. Can the mantissa be a decimal number? If so, how many places? Or can it be missing completely (that was the Algol convention)? How many decimal places are allowed in the exponent?

This is why the ISNUMERIC() function in SQL server confuses people who do not use floating-point. Very quickly can you tell if “E5” is a variable name or a number? No way! This is why I urge people to use a zero or one as the first character in a floating-point constant.

Character Data Types

Back when SQL started, we had EBCDIC and ASCII. Various other data types for characters pretty much faded or there were tools to convert them into one of those two standards. There was TTS for teletypes (used by typesetting equipment, mostly), field data (CDC computers) and various five and seven level paper tape codes (young people want to ‘Google’ the phrase ‘paper tape’).

Today, character data means ASCII and Unicode. The old ASCII is still a good idea. The Unicode standard requires that all human languages must support a small set of ASCII (Latin letters, digits, and some punctuation marks) so that ISO standards can be ported across the human race. This is why the design of data encodings needs to consider this. Adding special symbols, non-Latin alphabets, and other things will simply destroy your portability.

Even when character strings were very simple, assessing equality, equivalence and similarity were not.

Case-sensitivity

Depending on your programming language, upper and lower case letters could be equivalent (trivia question: what writing systems have both upper and lower case? Answer: Latin, Greek, Cyrillic, and Arabic. In the case of Arabic, since it is a connected script, the letters have an initial form, middle form terminal form and standalone form for each letter).

Trailing blanks

Do trailing blanks count? In SQL, the shortest string in a comparison is padded with trailing blanks, then the characters are matched position by position. And yes, case counts. If you remember the xBase languages, their rule was to truncate the longer string and then compare them position by position. Some products allowed you to pick whether or not to be case-sensitive.

Diacritical marks

But it does not stop here! If you have seen Vietnamese, it is a Latin-based system that adds lots of diacritical marks to the basic letters. The Unicode standard specifies a normalization (not the same thing as we have in SQL!) It specifies the order in which those marks are added to the base letter. So before you can even do equality test with a letter, you have to normalize it.

Aggregation

We actually have two equivalence operators on strings. The obvious one is the equality, with that special rule about the trailing blanks. But there are also GROUP BY operations. When you do a group by in SQL, all the NULLs get put into one group. We did debate this in the committee and the very early days for the SQL 89 standard. As more features that depend on grouping rather than equality were added, we started propagating this rule.

However there something else. The non-NULLs are grouped based on their equality test. So if I have {‘Joe’, ‘Joe ‘, ‘Joe’} in my group, which one represents my group? The one with no trailing spaces, the one with a single trailing space or the one with two trailing spaces? Should it be implementation-defined? Should it be random? Why?

Semantic Equivalence

Finally, let us think about semantics: Is “Joe” equal, equivalent, or in some way similar to “Joseph”? Well, sort of. But frankly SQL is not meant for this kind of semantic work. This is why I tell people to buy mailing list software. People try to correct names, split out family and first names, put the address in the CASS format that the post office wants to see, and lots of other things. If you are paying yourself more than two dollars an hour, it is not worth it to try and write the stuff yourself. Oh, and after you have written it, do not forget you have to maintain it and update it.

Temporal Data

SQL is the first language to include temporal data types. In COBOL and FORTRAN, the programmer had to invent his own representation using the basic data types. COBOL uses strings and typically FORTRAN used integers. There were no particular standards for libraries to work with these representations, so every software package tended to reinvent the wheel.

I am always surprised that so many SQL programmers do not know that the term “field” means a component in a temporal value; year, month, day, hour, minute, seconds. Their mindset is still locked back into filesystems so they confuse a field with a column; they are nothing alike.

Temporal data comes in two basic types; dates and times. SQL Server uses the term “datetime2(n)” for what ANSI/ISO standards call a timestamp. Time, by its nature, is a very different scale from anything we have looked at; it is an interval scale. There is no starting point. There is no ending point. The units of measure are weird and it is a pain to do math with them. The only thing we agree on is a calendar that advances into the future (one direction only!) One day at a time.

For commercial programming, date is usually a sufficiently fine level of resolution. This is why COBOL programs do not typically carry a time component. But this is actually correct. The first conceptual problem people have is that a date is not a point in time. It is an interval that starts at 00:00:00 HRS and ends just before 23:59:59.999. This is the ISO half open interval model of time. And there is no such thing as 24:00:00 HRS; however, you will find their software packages, like DB2, that will accept it as input then converted over to midnight of the following day. This does not stop us from saying 24 hour time were being casual, though.

The problem with intervals is they fail after simple equality (both start and in times are identical). The only option in ANSI/ISO Standard SQL is the OVERLAPS (s1, t1, s2. t2) predicate. But intervals can abut, precede, follow, or contain each other. Some of this can be important if you are trying to find out where someone or something was located at a particular point in time.

My advice is to try and design your database so that temporal data is stored with just sufficient granularity for your needs (again, probably just dates), and to use UTC instead of local time. Depending on your particular SQL product, you will have to write supporting conversion tables for daylight saving time (DST), and local time zones. The ANSI/ISO standards do provide for local time zones, but SQL Server is still behind on this.

Conclusion

The purpose of this essay is to shake you up a little bit about the way that even the simple topic of data equivalence can get tricky in SQL, and get you thinking about how data relates to the real world. I hope it worked. I also hope I got you read some Google, Wikipedia, ANSI/ISO Standards and BOL in between slinging code.