Simple SQL: Handling Location Datatypes

How do you record locations in SQL? Most relational database systems support spatial and geographical data, generally using the round-earth system based on the SQL specification of the Open Geospatial Consortium (OGC). However, this is not the only approach, as Joe Celko explains.

Where am I?

One of the big problems in data processing is in defining locations precisely. We need, for example to know where people live and where organizations operate. You might want to send them mail or packages; it could be that you need some idea of where they are so you can do GIS analysis on them. As an IT professional, there are some basic methods that you probably ought to know, even if this is not a major concern for you. Let’s do a quick overview.

Longitude and Latitude

Most of us are used to locating positions on earth with (longitude, latitude) pairs. This geographic coordinate system was printed in every globe we grew up with in every school room, and is still the most useful way of recording a location. The longitude specifies an East-West position relative from the prime Meridian, in terms of lines running from North pole to South pole. Technically, you are supposed to use the Greek letter lambda (λ) for a longitude, But I will be honest, I have never seen anyone do that. For historical reasons, the Prime Meridian (zero degrees longitude) is the one that passes through the Royal Observatory at Greenwich, England. This is where we got Greenwich Mean Time (GMT), GMT is a time zone officially used in many European and African countries, which is often confused with Universal Coordinated Time (UTC) which is a time standard that is the basis for civil time and time zones worldwide. But that’s another story.

The longitude of other places is measured as the angle east or west from the Prime Meridian, ranging from 0° at the Prime Meridian to +180° eastward and −180° westward. A location’s North –South position along a meridian is given by its latitude, which is approximately the angle between the 0° at the Equator to ±90° (North or South) at the poles. Lines of constant latitude, or parallels, run east–west as circles parallel to the equator. Latitude is used together with longitude to specify the precise location of features on the surface of the Earth.

But this traditional ‘round-earth’ coordinate system has problems. The first problem that all terrestrial coordinate systems have to work with the earth which is not really round. it is kind of an oblate spheroid and lumpy. Then, as you get closer to either the North or South Pole, the math involved uses spherical trigonometry. Spherical trig in a computer will involve floating-point numbers, and there will be some rounding errors that accumulate. In short, the math actually it is pretty ugly, which is why we make the computer do it. Or we just do not go to the North Pole very often.

In SQL Server, there is a geography CLR spatial data type, called geography that uses this round-earth coordinate system to store ellipsoidal (round-earth) data. It stores latitude and longitude coordinates.

And provides methods based on the Open Geospatial Consortium (OGC). We can show how to use these to calculate the distance between places. We’ll chose some locations in Ireland for our example.

Giving

the distance from Shannon to Kilkenny is 67.8293 miles

We convert meters to miles simply by multiplying by 0.000621371.
We can use this method to find out some of the longest distances in Ireland

Longitude: The True Story of a Lone Genius Who Solved the Greatest Scientific Problem of His Time by Dava Sobel is a great book that details how we learned to compute longitude by using highly accurate timepieces designed and constructed by John ‘longitude’ Harrison. It was also made into a film called ‘Longitude’ that is on Youtube. Latitude is much easier; you get out a lookup table and check the position of the sun and the stars. Today, you probably have an app for this on your cell phone.

Hierarchical Triangular Mesh (HTM)

An alternative system to the usual (longitude, latitude) pair is the Hierarchical Triangular Mesh or HTM for short. While (longitude, latitude) pairs are based on establishing a point in a two dimensional co-ordinate system on the surface of the earth, HTM is based on dividing the surface into almost-equal sized triangles with a unique identifier to locate something by a containing polygon. This approach is really good for satellites, which can look down on the earth, while the old system was something you could do if you are moving on the surface.

If you have seen a geodesic dome or Buckminster Fuller’s maps, you have some feel for this approach. HTM start with an octahedron at level zero. To map the globe into eight triangles, align it so that the world in first cut into a Northern and Southern hemisphere. Now slice it along the Prime Meridian, and then at right angles to both those cuts. In each hemisphere number the spherical triangles from 0 to 3, prefixing them with either N or S hemispheres.

A triangle on a plane always has exactly 180°, but on the surface of a sphere and other positively curved surfaces, it is always greater than 180° and less than 180° on negatively curved surfaces. If you want a quick mind tool, think that a positively curved surface has too much in the middle. A negatively curved surface is like a horse saddle or the bell of a trumpet; the middle of the surface is “too small” and curves the shape.

The eight spherical triangles are labeled N0 to N3 and S0 to S3 and are called “level 0 trixels” in the system. Each trixel can be split into four smaller trixels recursively. Put a point at the middle of each edge of the triangle. Use those three points to make an embedded triangle with great circle arc segments. This will divide the original triangle into four more spherical triangles at the next level down. Trixel division is recursive and smaller and smaller trixels and goes to any level you desire.

To name the new trixels, take the name of the containing trixel and append a digit from 0 to 3 to it, using a counterclockwise pattern. The (n = {0, 1, 2} point on the corner of the next level is opposite the same number on the corner of the previous level. The center triangle always gets a 3.

The triangles are close to the same size at each level. As they get smaller, the difference also decreases. At level seven approximately three-quarters of the trixels are slightly smaller than the average size, and a quarter are larger. The difference is because the three corner trixels (0,1,2) are smaller than trixel 3 in the center one. The ratio of the maximum over the minimum areas at higher depths is about two.

Smaller trixels have longer names and the length of the name gives its level.

The name can be used to compute the exact location of the three vertex vectors of the given trixel. They can be easily encoded into an integer identifier by assigning two bits to each level and encoding ‘N’ as 11 and ‘S’ as 10. For example, ‘N01’ encodes as binary 110001 or hexadecimal 0x31. The HTM ID (HtmID) is the number of a trixel (and its center point) as a unique 64 bit string (not all strings are valid trixels).

While the recursion can go to infinite, the smallest valid HtmID is 8 levels but it is easy to go to 31 levels in represent in 64 bits. Level 25 is good enough for most applications, since it is about 0.6 meters on the surface of the Earth or 0.02 arc-seconds. Level 26 is about 30 centimeters (one foot) on the earth’s surface. If you want to get any closer than that to an object, you are probably trying to put a bullet in it.

The What3Words System

This is the proprietary system of a British startup. The What3Words grid system divides the world into a grid of approximately 57 trillion 3 meter by 3 meter squares. Then each square is given a unique 3 word triplet as their address. The advantage is that this triplet is less ambiguous than a string of digits.

The company has a free mobile app or online map. It can also be embedded any other app, platform or website. The company is still growing, but their most unusual customer is probably Mongol Post, the Mongolian postal service. The country is not very populated, the roads often lack well-known names and the street addresses are unreliable. Now factor in the problems of having about a quarter of the mongols are nomadic. Now, instead of an address each 9-square-meter plot in Mongolia will receive its own three-word identifier.

The simple phrase is easy to remember and less likely to have same errors as a string of digits. But even if the speaker or typist makes an error, they can go to a spelling checker and try again. This is not really possible with numeric encodings. The products started with English, quickly followed by French, Spanish, Portuguese and Russian, and then Swahili, Turkish, Swedish and Italian. They now have Mongolian, Finnish, Arabic and Polish.

Here’s a quick sample of some famous locations on earth:

  • President of the United States => Engine.Doors.Cubs
  • British Prime Minister => Chief.Score.Locked.
  • Statue of Liberty => Planet.Inches.Most.
  • Great Pyramid at Giza => Dreaming.Munch.Magnetic.
  • Eiffel Tower => Graphics.Dads.Inched.
  • Golden Gate Bridge => Super.Skirt.Letter.
  • North Pole => [Not found].

The What3Words capability is now integrated within the ArcGIS platform as a locator and it is also available for QGIS. There is a web service that provides the mapping between Latitude/Longitude and What3Words location.

Standard Address Number (SAN) and Global Location Numbers (GLN)

The book business has had encoding schemes for organizations, book and journal publishers, wholesalers and distributors, book retailers and college bookstores, libraries, schools, universities, paper and cloth manufacturers, printers and binders, as well as anyone else involved in the manufacturing of books and journals. The best-known code is the International Standard Book Number (ISBN), which identifies books. It used to be a 10 digit system with a modulus-eleven check digit (before you ask, if the remainder is 10 you use X, the Roman numeral for the final position in a string). This old ISBN with a 13 digit string that has to begin with 978; it is part of the European Article Number (EAN), also called International Article Numbers, barcode system and you can see it by picking up any book you have got sitting around right now. The check digit changed, but the old ISBN is embedded in it.

The book industry also created the Standard Address Number (SAN) for use with various electronic interchange systems. The SAN consists of six digits, plus a modulus-eleven check digit, displayed with a hyphen following the third digit (XXX-XXXX) to facilitate transcription and look up. Bowker, the assigning agency, make sure they are unique and addresses are current. The SAN is not reassigned and stays constant., even if the unit involved moves or changes its name.

Today, the old SAN is being replaced with the Global Location Number (GLN, defined by ISO/IEC 6523) which can be derived from the SAN. It is made up of 13 digits and is sometime referred to as an EAN location number. If you already have a SAN, you can upgrade for free. The first few digits of the company prefix, followed by the location reference and finally ending with a check digit. It is now being used for more than just the book trade.

CASS and LACS Standards

CASS stands for the Coding Accuracy Support System (CASS). it is a tool created and used by the USPS to ensure the accuracy of software that taps into their database by standardizing the text of mailing addresses. They also certifies services to correct , validate and update mailing addresses, check spelling, proper abbreviations for postal units, ZIP Codes, and some other things.

Internally, the system uses DPV and LACS when checking addresses. DPV means Delivery Point Validation; this is what got a letter on your doorstep; a human did it. LACS means Locatable Address Conversion System; it is also called the 911 (pronounced nine-one-one, to avoid confusion with nine-eleven) because it is used in the US for emergency services to find a location can be found. This database is concerned with current addresses and not current residents because it has to be able to send out emergency services. The addresses originally could have been assigned by local governments, or even local customs, so finding anybody in an emergency was difficult.

In the national standards, building numbers originate from central locations, even and odd street numbers are on opposite sides of the street, street numbers increase in the uniform fashion, architectural features have uniform two-letter codes, compass points are defined and so forth. You could actually write a regular expression for standardized address now! I am old enough to remember when the post office would accept addresses with a state name on the last line and the name of the ranch on the second line (i.e. “King Ranch, Texas” is a classic).

It is important to note that LACS is not the same as NCOA. NCOA stands for National Change of Address, and it is the database that tracks forwarded addresses. A forwarding address it gets printed on a sticky label and put on the mail, but only for a short time. Unlike NCOA, LACS does not track the “who” of the address, just the “where.” Also unlike NCOA, LACS is very reliable.

Address validation can be done on international addresses as well, and in these cases, addresses are checked against the primary database in that country (Royal Mail in the UK, Deutsche Post in Germany, La Poste in France etc). In the US, that authority on mailing addresses is he United States Postal Service, which means that the most accurate way to validate a US address is to use a CASS certified provider.

Summary

It is easy to get trapped into one tool or model of data. I know it is a cliche, but everything looks like nail when all you have is a hammer. the longitude/latitude system is great f you are on the surface of the planet; HTM is great if you are a satellite in orbit; those three-word names are great if you are delivering packages by Yak in Mongolia. The nice part of having several systems is that if they measure the same attribute (such as geographical location), then they ought to be convertible from one to any other. The dangers are that if you store data in more than scale, then (1) you become the man with two or more wristwatches who is never quite sure what time it is (2) you waste time and resources doing conversions as you try to synch various tables n the same schema. Pick one scale and make it the official one from which all others are derived; this is what VIEWs and computed columns are meant to do.

References