Converting Manually-Entered Strings to Dates

Before computers, the British, US and Canadians all entered dates with the full month-name first, followed by the day number, followed by an ordinal suffix (1st, 2nd, 3rd and 4th etc) followed by the year. It was generally the full four-figure year. I know all this because I’ve just been converting many thousands of these dates from transcribed historical records.  They are the dates from a whole collection of the transcriptions of old newspaper stories going back over a couple of hundred years. At first glance, I reckoned that there would be a good chance that the Convert/try_convert function would deal with this. It does so up to a point. 

It doesn’t mind the month being spelled out in full, and is relaxed about the month preceding the day, but it shies like a frightened horse at the suffix. We can produce the format, we just can’t read it.

Aha! you think, just remove these ordinal suffixes. Not so fast. try to remove ‘st ‘  using the Replace() function and ‘August’ becomes ‘Augu’. You spot that some dates just don’t have a day, because they apply to the whole month. Also, you find that due to typing errors, and the randomness of human behaviour at times, stuff creeps in to manually-entered dates such as dots instead of spaces and appendages to the year. Suddenly, you are going for a drive down complexity avenue, a route with little hope of a U turn. You are faced with either correcting the source (sorry, too many) or producing a string created from the data elements surgically extracted from the slightly erratic version, that can then be passed to the Convert function to deal with. 

One of the joys of having  26,000 date strings to deal with is that you have plenty of test data for a function.  Curiously, the combination of PatIndex() and Substring() can be quite powerful. The trick is to provide defaults. You have to ensure that PatIndex always succeeds, by providing a default that passes the test attached to the end of the string. You need to add a space to the beginning to deal with untrimmed strings. 

It’s time to create a simple test harness and work away until  all 26,000 dates can be parsed. There were a very few that had to be corrected at source, but otherwise it went well.

Having go this far, we can create a function. I hate doing this until I’m confident that something works well because functions, I believe, can only be debugged in a test harness. Once you’ve glued it into a function, you’re more limited.

As you’ll see from the examples, this is pretty lenient about the format of the date. It had to be, because correcting all the dates at source wasn’t an option.

With this done, search results could be retrieved on date ranges, which is a great bonus. 

I’m much happier with regex queries than the SQL wildcard convention. However, with some ingenuity they can be made to work. By providing defaults, you can cut down the complexity of the logic but there is always a danger of introducing errors that in some circumstances would be damaging. In my case, this would only apply to a faulty reading of the year and a default year of 2020 would, in this case, retrieve the worst critters. 

Testing  of this has to be done by eyeball. I use random samples of 200 records. it could be done with a function that uses an entirely different algorithm to parse a date that is liable to manual errors,  but for the life of me I can’t think of one so liberal!