Discrete and Continuous data in SQL

Not all data is discrete; some data types represent a continuum. In SQL, we have to approximate them and live with the special problems of handling continuous data. We need to understand the problems associated with continuous data types, when these will happen, and how it affects constraints and the results of queries. Joe Celko explains.

Beware of Derf

Consider derf. Derf is a whole number between 5 and 6. It was created in the episode “iTwins” by the lead character of the children’s TV show, iCarly, Carly sees a video of Chuck beating up Spencer, her older brother. To get revenge on Chuck, she makes up the number Derf, and tricks him into thinking it’s part of the new International Math Laws that were recently passed by the U.N. Chuck uses the number during his math test and fails it. He loses his summer vacation and is sent to a math camp instead. Justice is served!

Besides the acting in the sketch, why is this funny? It is absurd. Even without a math degree, people understand that integers are “discrete”, a term in math which has a specific meaning, but roughly means that each element in a set is known exactly, can be counted and cannot split any further. There is no derf. There cannot ever be a derf. There is no place to put it!

Computer people live with a binary and discrete world-view. We like that model because it is simple and it has been good for us. We can live with binary integers and character strings. We can represent discrete things with encodings.

But not all data is discrete. One alternative to discrete sets is a continuum. The most most common continuum in a database is a timestamp and the second most common data type are the REAL, FLOAT and DOUBLE PRECISION family. But we cannot represent these continuous values in a computer. We have to approximate them and live with errors, model them within constructs of the language or use programming languages that manipulate them symbolically. o:p>

The symbolic solution is used in mathematical programming languages, but only for important irrational numbers, like Ï, e, and â2. They display the computations in symbolic form, so we do not much care about them in the database side of the house. SQL was not meant for graphics, heavy computations, natural language processing and lots of other things. This is not a problem any more than the fact that a screwdriver is not a saw.

The “constructs in the language” approach is familiar to old FORTRAN programmers. The early FORTRAN only has arrays and scalars. We got really good at using arrays for everything that was not a scalar numeric data element. For example, complex numbers were modeled as an 2 by (n) array; strings were done as one dimensional arrays and so forth. But this means that the basic operation had to be encapsulated as functions using basic operations.

In SQL, we also have only one construct – the table. We can use columns in a row in much the way that we used arrays. Vadim Tropashko has used this approach to model rational numbers in Oracle as (numerator, denominator) pairs and user-defined functions. But this still does not give us good performance, a standardized set of rules and standardized representation. That is why we have IEEE Standard floating point numbers today.

Let’s go back to Ï, which I rounded to 3.14 in middle school, then to 3.14159 in high school and finally to 3.141592653 when I got my first pocket calculator with a ten-digit display. Back in middle school, I knew (3.14 < Ï < 3.15). This interval contains Ï, but I do not exactly where it is. Later the interval became (3.14159 < Ï < 3.14160) , and continued to get smaller and smaller as I got tools with more and more decimal places. But I cannot ever get to Ï. so instead of thinking of Ï as a discrete point on a number line like 5 or 6, think of it as a “cloud of points” that has Ï inside it, somewhere.

The boundaries of the cloud are Ï Â± ε. As the value of ε approaches zero, I get better and better approximations in this series of infinitely nested sets. These sets are called open intervals because the end points are not in the set. This is the concept of a limit; I can get as close as I wish to Ï , but I cannot actually get to it. In the real world, a computer cannot have infinite precision. Or infinite storage, in spite of what your hardware vendors tell you.

How are these approximation modeled in my finite world? Let’s start with REAL, FLOAT and DOUBLE PRECISION data types, since they have established Standards (IEEE-754) that have been implemented in hardware today. You can get a detailed tutorial on floating point numbers at: (Complete) Tutorial to Understand IEEE Floating-Point Errors and I recommend you do this. There are also good tutorials you can find on the internet, too.

Back in the early days of computing, there were proprietary floating point from vendors. These formats all worked on the same principles. A number made up of a three parts; a base, an exponent, and significand (or the older term mantissa). The base is either two or ten in practice, but could be any integer in theory. The exponent is another integer; it represents the exponent to which the base is raised in this formula:

This is classic scientific notations. The IEEE Standards have several versions. But the most common ones are:

  • IEEE Short Real or single precision is 32 bits (1 bit for the sign, 8 bits for the exponent, and 23 bits for the significand.
  • IEEE Long Real or double precision is 64 bits (1 bit for the sign, 11 bits for the exponent, and 52 bits for the significand.

If you want a mental model, imagine a set of points spread like unevenly spaced dust over a number line. This is how we get the range of values we have in floating point numbers. If we cannot hit one of these points, we try to “get close” and learn to live with errors and approximations. We try to minimize the errors, tho. If you have some time, google all of the tricks and algorithms we have to use. They are complicated! They are ugly!

But the complexity is hidden from you. You do not see the base and exponent being made equal under the covers, so you can do addition. You do not see the elaborate algorithms for multiplication and division. Under the covers, we have epsilon! When you compare floating point numbers, their epsilon neighborhood vary depending how close to zero they are.

Create a one real column non-keyed table and fill it with thousands of floating point numbers, both positive and negative. Then do a SUM() on it. Take the same data set, sort it in ascending order then sum it again. Sort it in descending order then sum it again. Finally, sort it is ascending order by absolute value, then sum it again. The four totals will all be a little different in the last digits because of floating point rounding errors accumulating. If you care, the sort-by-absolute-value set will produce the most accurate result. Languages that work with floats will do a lot of re-arranging of the computations to minimize errors. No SQL product I know does this; it would destroy performance. We did not mind a little extra time when we compile a FORTRAN programs because fast, correct answers are more important than fast compiles

To avoid these problems today, we are more likely to use DECIMAL(s, p) or NUMERIC(s, p) data types to get reasonably accurate results than to use REAL, FLOAT or DOUBLE PRECISION. In SQL Server, we treat DECIMAL(s, p) and NUMERIC(s, p) as synonyms, but in the ANSI/ISO standards treat NUMERIC(s, p) as having exactly p decimal places. This was to accommodate systems with COBOL-style numbers – character strings. But the DECIMAL(s, p) has has no less than p decimal places. This was to accommodate systems with BCD numbers. If you younger programmers have never seen BCD (Binary Coded Decimal), then Google it. This means that DECIMAL(s, p) can give better rounding than NUMERIC(s, p) because of the extra digits.

Another consideration is that when you model money, never use the old Sybase proprietary MONEY and SMALLMONEY data types.

  1. They are proprietary, so porting it is a pain. It is one of the many “Sybase Code Museum” features from decades ago. This was a way keep COBOL programmers happy; it makes SQL look like it has a PICTURE clause which stores a currency sign, commas and decimal points in the data.
  2. Writing code in dialect when you don’t need to make you sound like a hillbilly to people that speak the language. You are better off with DECIMAL(s, p) so you can use a properly sized column.
  3. It violates many current GAPP and EU rules about precision and rounding rules. You need to research just what the laws and regulations are in your industry.
  4. The MONEY data type has rounding errors. Using more than one operation (multiplication or division) on money columns will produce severe rounding errors. A simple way to visualize money arithmetic is to place a ROUND() function calls after every operation. For example,

    can be rewritten using money arithmetic as:

    Rounding to four decimal places might not seem an issue, until the numbers you are using are greater than 10, 000.

    Another example:

Time, the Other Continuum

The other continuum in SQL is the timestamp, which is the DATETIME2(n) in SQL Server. The nature of time is very different from real numbers. A dimensionless quantity is a quantity without an associated physical dimension. Time is a dimension that measures the movement of objects. Time moves and moves in one direction. There is a joke about a father showing his young son, who has only seen digital clocks and watches, an old pocket watch with a sweeping second hand.

  1. “What is it, Daddy?”
  2. “It’s a watch! This is how we used to tell time.”
  3. “HOW!? It is always changing!

Just as we had the IEEE standards to model real numbers, we have the ISO-8601 standards to model time and display it. The ISO model is based on half-open intervals, not open intervals like the reals. The temporal starting (earliest) point is in the interval, but the end point (latest) is not. The advantage is that you can abut two half-open intervals together and get a third half-open interval. Likewise if you remove a half-open interval from another half-open interval, you get one or two half-open intervals. This prevents an entity from being in two places at the same time.

In spite of the fact that we talk about “24 hour time” (or “military time”), there is no such thing as “24:00:00” on the clock. IBM and many software packages will let you write this, but they convert it to “00:00:00” of the next day for you.

The bad news is that our temporal scales are irregular. We have an assortment of various sized traditional units for durations. INTERVAL data types are used to represent temporal duration. They come in two basic types. Intervals that deal with the calendar and those that deal with the clock. The year-month intervals have an express or implied precision that includes no fields other than YEAR and MONTH, though it is not necessary to use both. The other class, called day-time intervals, has an express or implied interval precision that can include any fields other than YEAR or MONTH — that is, DAY, HOUR, MINUTE and SECOND (with decimal places).

The units of time in an SQL temporal value are called fields; do not confuse this with the term “fields” are use with non_RDBMS file systems. The fields in the interval have to be in high to low order without missing fields. You already know this system, since you have used it all of your life. Here is a quick table of the subset of the Common Era calendar used by the ANSI/ISO Standard.

Field Inclusive value limit
YEAR ‘0001’ to ‘9999’; follows the ISO-8601 Standard
MONTH ’01’ to ’12’; the value 12 might be rounded to one year
DAY ’01’ to ’31’; must be valid for month and year
HOUR ’00’ to ’23’; value 24 might be rounded to the day
MINUTE ’00’ to ’59’; watch for leap seconds!
SECOND ’00’ to ‘59.999..’ ; precision is implementation defined

Notice that YEAR starts at ‘0001’ and not ‘0000’; this is probably probably good enough for any application you will ever write. But if you ever work for a museum, the ANSI/ISO Standards can extend this range. But you will have write your own support routines.

We decided that the only display format allowed in ANSI/ISO Standard SQL is "yyyy-mm-dd", with the dashes. The month within the year and day within the month are in sorted order, so that this fixed length string is easy to use and language independent. Many SQL products have string functions to format dates in a local dialect, such as the old Sybase SQL Server CONVERT() that lives on today. The reason for them is COBOL; No, really! In the COBOL world, everything is a string and all data formatting is done in the application. Today, we use a tiered architecture with presentation layers. These three fields make up a DATE data type. The DATE data type in SQL Server is only 3 bytes. You can use the BETWEEN predicate and finally treat dates and times are if they are discrete.

The original Sybase SQL Server was based on the hardware clock ticks of a DEC PDP-11 computer, plus floating rounding errors, so we got weird DATETIME values. We also had problems with DATETIME rounding the fractional part to clock ticks to increments of .000, .003, or .007 seconds. The rules are that the fractional seconds in DATETIME values are rounded to increments of .000, .003, or .007 seconds, as shown in the following table.

From

To

Rounds to

xx:xx:xx.999

x:xx:xx.999

00:00:00.00

xx:xx:xx.995

xx:xx:xx.998

xx:xx:xx.997

xx:xx:xx.992

xx:xx:xx.994

xx:xx:xx.993

xx:xx:xx.990

xx:xx:xx.991

xx:xx:xx.990

This is really floating point errors transposed into temporal values! We then had all kinds of kludges to trim off the time part to get a date starting at ’00:00:00.000″ to use as a date in applications. My favorite kludge was :

This trick depends on the fact that the Sybase model using a floating point number with the date in mantissa and the time in the exponent; if you wanted to get the next day, use CEILING() instead of floor. I was working with minicomputers that had floating point hardware, so this was fast and easy. Today, you can use the simple, portable CAST(@test_datetime AS DATE) instead. The TIME data type is made up of HOUR, MINUTE, and SECOND. Technically, the only unit of time in the SI (“Système International d’Unités” aka, the Metric System) is the second.

This was strictly local time (or lawful) time taken from a local clock. We had no idea that getting a time signal from global satellites would be cheap and common enough to use in everyday hardware like cell phones, wrist watches or alarm clocks.

Coordinated Universal Time (UTC aka “Temps Universel Coordonné”) is the standard time from which all local times are derived. It is one of several closely related successors to Greenwich Mean Time (GMT), but do not worry about them unless you are doing physics. Yes, people still say GMT, but it has not exists for decades. Yes, there is a lot of French in ISO Standard names and the initials do not match the English.

Today, Standard SQL has built-in features for local times, time zones and UTC. SQL Server has added UTC time to local time, and the DATETIMEOFFSET  data type for the same functions. They let you define local time as offsets from UTC, but you have to handle Daylight Saving Time (DST) on your own. There is a good series of articles on this topic by  Aaron Bertrand  starting at: Handle conversion between time zones in SQL Server – part 1

But what about the data type itself? The SECOND field can have several decimal places for fractional seconds. The standard is silent on how many places, but the FIPS-127 Standards required at least five places and most modern machines use seven places. In the real world, we seldom need to keep data to the nanosecond, so this is usually just fine.

The exact decimals let us play with half-open intervals. One of the classic problems was to set up time slots with the old DATETIME data type and put events into them. Today, you can use a table of time slots set to one more decimal second of precision than your data. You can now use temporal math to add it to a DATE to TIME(1) get a full DATETIME2(0). Here is the basic skeleton.

Here is the basic query for filling a time slot from a DATETIME2(0) event.

Another classic T-SQL problem was a constraint to prevent a gap in a timeline of events:

As you can see, keeping valid time constraints is not easy!