On Handling Dates in SQL

The calendar is inherently complex by the very nature of the astronomy that underlies the year, and the conflicting historical conventions. The handling of dates in TSQL is even more complex because, when SQL Server was Sybase, it was forced by the lack of prevailing standards in SQL to create its own ways of processing and formatting dates and times. Joe Celko looks forward to a future when it is possible to write standard SQL date-processing code with SQL Server.

“I’ve been on a calendar, but I’ve never been on time.”
–Marilyn Monroe

The most international of all units of measurement on earth is the Common Era Calendar. This used to be known as the “Gregorian Calendar”, “Western Calendar” or “Christian Calendar “when it was issued by Pope Gregory in 1582. And the original company is still in business. There have been no upgrades since then, unless you want to count adding local names for the months. For example, Czech months  are of Slavic origin (Åíjen), whereas the Slovak months are of Latin origin (Október), but most languages use the Latin roots. But I am in database, so I only care about the subset of ISO-8601 display formats used in my language. The only use of the term ‘ field ‘ in SQL is for the parts of a timestamp (YEAR, MONTH, DAY, HOUR, MINUTE, SECOND ); for this article, all I care about is the DATE set of fields (YEAR, MONTH, DAY ). Do not panic, this subset is good enough for most commercial work. Those of you who deal with museums, talk to me off line.

Today, every SQL implementation has a DATE data type, but the functions available for them vary quite a bit. The most common ones are a constructor that builds a date from integers or strings; extractors to pull out the month, day, or year; and some display options to format output. Do not mimic 1970’s COBOL and format data for display, using a proprietary vendor feature.

In Standard SQL, the constructor is “CAST (<string expression> AS DATE)” for a string expression. Hey, we have this in T-SQL now! In Standard SQL, the only ISO-8601 format “yyyy-mm-dd” allowed for date values. Hey, we have this is how T-SQL displays by default in the new DATE, TIME and DATE TIME2(n)! Do you think that Microsoft is trying to catch up? The ANSI Standard also has “DATE (<string expression>)” option lacking in T-SQL, but that is simple syntactic sugar.

The ISO-8601 Standard has other date and time formats which are not ambiguous and could be added to the strings recognized by SQL. But we had to make a decision for the SQL Standard, so that we have one and only one way to work with dates. This avoids problems like confusing the British (“dd/mm/yy“), American (“mm/dd/yy“) and other national traditional shorthands. This format is also embedded in lots of other ISO Standards, hardware counters and anything that has to be sorted in temporal order. We picked the most readable, most universal one we had.

SQL Server still has the 1970’s Sybase/UNIX convert() string function. This is because when it was written, we did not know how to use tiered architectures. Client/Server was brand new! We had decades of monolithic COBOL as the only programming model (yes, I am that old). This means that all the display formatting is mixed in the the other code for data retrieval, computations, etc. To accommodate this, Sybase added convert() with local, non-standard displays for dates. There was no concept of independent presentation layers, separated from the database layer, doing local translations, computations and all the display work.

Sybase also added MONEY as a data type to put punctuation marks (currency symbol, comma, decimal point) in currency display. MONEY is another issue; it does not do correct math, does not port and has illegal precision for many GAAP and EU regulations. This was done with a PICTURE spec in COBOL (http://en.wikipedia.org/wiki/Picture_clause), if you want to see where Sybase got this.

The ANSI/ISO extractor for is the function EXTRACT(<extract field> FROM <extract source>)

The extract field options for dates are a little easier to see this in a table

Meaning of <primary date field>

Keyword

Year in Common Era Calendar (returns 0001-9999)

YEAR

Month within year (returns 01-12)

MONTH

Day within month (returns 01-31)

DAY

The obvious ordering is from most significant to least significant: YEAR,MONTH and DAY .

The date fields contain non-negative integer values, constrained by the natural rules for dates using the Common Era calendar. Please note a few things you might not think about if you are not a “data junkie” who stayed up to watch the leap-second one night.

There is no year zero. There are no negative years, either. The correct postfix is “CE”, and “BCE” for “Common Era” and “Before Common Era” dates, replacing AD (“Anno Domini ‘) and BC (“Before Christ).

Standard SQL also has an INTERVAL data type. This is a “mixed measurement”, made up of temporal fields, just as we Americans mix feet and inches. An INTERVAL is a period of time, such as “3 years 2 months” or “90 days”; in T-SQL we have to use a single unit of temporal measurement in a parameter in an old Sybase function call to DATEADD() , DATEDIFF() and so forth.

The reason for the Sybase model is historical. The temporal data types were not in the Standards and the Sybase clock was based on a floating point number in the hardware. Following the UNIX model, we used clock ticks on the old 16-bit hardware. It works well with C programs, too.

There are effectively two kinds of INTERVAL sub-types for using with dates (there are more for time fields , but I am not talking about it here).

  1. YEARMONTH contains one or both YEAR and MONTH fields.
  2. DAY ” contains a positive or negative number of days. This looks like what we do in T-SQL, but without the mixed units and different syntax. .

The reason for the two interval types is to avoid the ambiguity that would arise if a MONTH value was combined with a field of lower significance, e.g. DAY, given that different months contain differing numbers of days. For example, the interval “2 months and 10 days” could vary between 69 and 72 days in length, depending on the months involved. Therefore, to avoid unwanted variations in the downstream arithmetic etc., the variable length MONTH component may only exist at the lowest significance level in an INTERVAL .

The value of a MONTH field, which is not in the leading field position, is constrained between 0 and 11, in an INTERVAL (and not between 1 and 12 as in a DATE). You have to round up a dozen months to a year, just you do not describe yourself as “five foot, 1 fourteen inches” on an application.

You can assume that your SQL implementation has simple date arithmetic functions, although with different syntax from product to product. The basic functions you need are just those that work with dates:

  • A date plus or minus an interval of days yields a new date.
  •  A date minus a second date yields an interval of days.

Here is a table of the valid combinations of <date> and <interval> data types in Standard SQL:

ANSI/ISO Standard SQL Math

<date> – <date> = <interval>

<date> + <interval> = <date>

<interval> (* or /) <numeric> = <interval>

<interval> + <date> = <date>

<interval> + <interval> = <interval>

<numeric> * <interval> = <interval>

There are other rules, which deal with time zones and the relative precision of the two operands that are intuitively obvious. You can also “CAST (<string expression> AS INTERVAL <interval type>)” if you need it.

There should also be a function that returns the current date from the system clock. This function has a different name with each vendor: TODAY, SYSDATE, NOW(), CURRENT DATE and getdate() are some examples. The correct keyword is CURRENT_TIMESTAMP for the system timestamp and CURRENT_DATE for the system date. T-SQL now has CURRENT_TIMESTAMP to replace the 1970’s getdate() from Sybase. Microsoft has been bad about getting up to specs, so T-SQL will still generate the old getdate() function call. Try to avoid it so you will look like one of cool kids and hope Microsoft keeps making progress. You can also use CAST (CURRENT_TIMESTAMP AS DATE) for CURRENT_DATE .

Leap Year

You ought to know better, but I am posting a warning anyway. You might remember being told in grade school that there are 365 ¼ days per year and that the accumulation of the fractional day creates a leap year every four years. Once more, your teachers lied to you; there are really 365.2422 days per year. Every four years, the extra 0.2400 days accumulate and create an additional day; this gives us a leap year. Every 400 years the extra 0.0022 days accumulate enough to create an additional day and give us this special leap year. Since most of us are not over 400 years old, we did not have to worry about this until the year 2000. However, every 100 years the missing 0.01 days (i.e. 365.25 – 365.2422 rounded up) balances out and we do not have a leap year.

The correct test for leap years is:

People who did not know this algorithm wrote lots of programs. I do not mean COBOL legacy programs in your organization; I mean packaged programs for which you paid good money. The date functions in the first releases of Lotus, Excel and Quattro Pro did not handle the day 2000-02-29 correctly. Lotus simply made an error and the others followed suit to maintain “Lotus compatibility” in their products. Microsoft Excel for Windows Version 4 shows correctly that the next day after 2000-02-28 is 2000-03-01. However, it thought that the next day after 1900-02-28 is also February 29 instead of March 01. Microsoft Excel for Macintosh did not handle the years 1900-1903.

Do not worry about working with other products and Standard SQL. I just want to help you read Standard SQL, but we are not done yet. The ISO-8601 Standards include an ordinal date and a week date format.

Daily Exercise

If you do not like the way that Pope Gregory arranged the months, do not blame him. He inherited that stuff from the Romans. The simplest way to label the days of the year is to number the days from 1 to 365 (or 366 in a leap year) without any further aggregations. This is called an “Ordinal date” and it is formatted as "yyyy-ddd‘ where ddd is an integer string with leading zeros between ‘001’ to ‘366’ when displayed. It is not called the “Julian date”, contrary to popular belief. The Julian date is a special count of days used by astronomers based on the Julian Day Number (JDN) (http://en.wikipedia.org/wiki/Julian_date). The Julian day number zero begins with 4714-11-24 BCE, it and gets big in modern times. For example, the Julian day number for 2000-01-01 was 2,451,545.

You probably want to use a ‘Julianized’ date. This sequence starts with some date that makes sense for your enterprise, say a century or two. Obviously this numbering will make math easier. But good trick is the ‘Julianized’ business date. Instead of using a monotonic increasing sequence, use an increasing sequence (i.e. it has duplicated values). This is easier to explain with an example. Assume it is 2007 and our company takes Easter as a long holiday (this is a year with Catholic and Orthodox calendars agree).

To compute the business days from Thursday of this week to next Tuesdays:

Do not try to compute holidays with temporal math functions. Look at the formula for Easter; it is mess. And which Easter did you mean? If you are an American, do you remember all the holidays for the fifty states? When did we create President’s Day from Washington’s and Lincoln’s Birthday?

The DATE data type is only three bytes, and a century worth of dates is only 36,525 rows. This is cheap and small for today’s computers. The table is read-only forever and is indexed on the “cal_date” with other display formats as INCLUDE columns on the index. But more than that, the common Business Calendar prevents bad data. I worked with a Greek and a Russian Orthodox programmer who had gotten their Easter list from their priest. I had two matching answers, so I accepted the it. But other calendars in the company were done by Catholics. You can see the problems.

A Week Defense

In most SQL products there is a vendor function to return the day of the week from a date, which is sometimes called DOW() or WEEKDAY (). The underlying formula is called Zeller’s congruence (http://en.wikipedia.org/wiki/Zeller%27s_congruence) which has versions for both the old Julian and the Common Era calendars. This was what we used in the old days because the math was hidden in a procedure call in C. We all had a copy of “Standard C Date/Time Library” by Lance Latham, (ISBN : 087930-496-0 or ASIN: B00336ESWE now on Kindle).

The problem is that Zeller’s formula returns a number between zero and six – it has a MOD() function in it. Too bad this wrong. The  ISO Standard Day-of-Week is defined as 1 = Monday to 7 = Sunday. The Microsoft DATE PART() also returns a number from 1 to 7, but with the default of 1 = Sunday to 7 = Saturday. As a generalization, ISO prefers to use zero as a value for a missing or unknown value in an encoding scheme. I will explain why that is useful a little later in this article.

The ISO-8601 Standard includes a “week date” format based on the day-within-week within year that is popular in Nordic countries. This format is ‘yyyyWww-d‘ where yyyy is the year, W is a separator token, ww is (01-53) week number and d is (1-7) day of the week, as we just discussed. To make this easy to validate, use:

Weeks start with Monday. The first week of a year is the week that contains the first Thursday (and, hence, January 4th) of their year. There are several websites with calendars you can cut and paste, but you can start your search with: Week Numbers for 2014. Now look at this year and notice that 2014W52 ends in 2014, but 2015W01 overlaps into 2015.

Week

Start Date

End Date

1

2014-12-30

2014-01-05

2

2014-01-06

2014-01-12

3

2014-01-13

2014-01-19

4

2014-01-20

2014-01-26

5

2014-01-27

2014-02-02

6

2014-02-03

2014-02-09

7

2014-02-10

2014-02-16

8

2014-02-17

2014-02-23

9

2014-02-24

2014-03-02

10

2014-03-03

2014-03-09

11

2014-03-10

2014-03-16

12

2014-03-17

2014-03-23

13

2014-03-24

2014-03-30

14

2014-03-31

2014-04-06

15

2014-04-07

2014-04-13

16

2014-04-14

2014-04-20

17

2014-04-21

2014-04-27

18

2014-04-28

2014-05-04

19

2014-05-05

2014-05-11

20

2014-05-12

2014-05-18

21

2014-05-19

2014-05-25

22

2014-05-26

2014-06-01

23

2014-06-02

2014-06-08

24

2014-06-09

2014-06-15

25

2014-06-16

2014-06-22

26

2014-06-23

2014-06-29

27

2014-06-30

2014-07-06

28

2014-07-07

2014-07-13

29

2014-07-14

2014-07-20

30

2014-07-21

2014-07-27

31

2014-07-28

2014-08-03

32

2014-08-04

2014-08-10

33

2014-08-11

2014-08-17

34

2014-08-18

2014-08-24

35

2014-08-25

2014-08-31

36

2014-09-01

2014-09-07

37

2014-09-08

2014-09-14

38

2014-09-15

2014-09-21

39

2014-09-22

2014-09-28

40

2014-09-29

2014-10-05

41

2014-10-06

2014-10-12

42

2014-10-13

2014-10-19

43

2014-10-20

2014-10-26

44

2014-10-27

2014-11-02

45

2014-11-03

2014-11-09

46

2014-11-10

2014-11-16

47

2014-11-17

2014-11-23

48

2014-11-24

2014-11-30

49

2014-12-01

2014-12-07

50

2014-12-08

2014-12-14

51

2014-12-15

2014-12-21

52

2014-12-22

2014-29-28

1

2014-12-29

2015-01-04

Report Period Tables

Since SQL is a database language, we prefer to do look-ups and not calculations. They can be optimized while temporal math messes up optimization. A useful idiom is a report period calendar that everyone uses so there is no way to get disagreements in the DML. The report period table gives a name to a range of dates that is common to the entire enterprise and which applies to a particular report’s reporting periods.

These report periods can overlap, have gaps or no particular pattern at all. I like the MySQL convention of using double zeroes for whole months and years. That uses ‘yyyy-mm-00‘ for a month within a year and ‘yyyy-00-00‘ for the whole year. The patterns for using LIKE are '[12][0-9][0-9][0-9]-00-00' and ‘[12][0-9][0-9][0-9]-[01][0-9]-00' in a validation.

The advantages are that it will sort with the ISO-8601 data format required by Standard SQL and it is language independent. Using the Date_Ordering CHECK() constraint will help the optimizer as well as preventing unintended time travel.

Another useful trick is the ISO half-open interval model of time. Time moves in one direction, forward., into an unknown future. This is why an event will start before it ends and why we might not have a known ending date.

Unlike the IEEE 754 Floating Point Standards, we do not have a “plus and minus eternity” to correspond to their numeric  plus and minus infinities. But we can use a CAST (NULL AS DATE) in the end_date column and a non-NULL start_date. The trick is to remember to coalesce that NULL in queries:

Reading Assignment:

Here are a few books for a quick reading list.

  • Humanity’s Epic Struggle to Determine a True and Accurate Calendar: by David Ewing Duncan; ISBN-13: 978-0380975280.
  • The Oxford Companion to the Year: An Exploration of Calendar Customs and Time-Reckoning by Bonnie Blackburn; ISBN-13: 978-0192142313.
  • Mapping Time: The Calendar and Its History by E. G. Richards; ISBN-13: 978-0192862051.
  • The Seven Day Circle by E. Zerubavel; ISBN-13: 978-0029346808. This is the detailed history of the week and other units less than a month in many calendars.