The rapid pace of the world around us often makes us take many small miracles for granted. For example, I can go on Amazon Japan, order some gadget, and have it delivered to my door within days (eventually, within minutes). We only get to see the delivery guy (or girl, if you’re lucky) dropping off the package, but there are many things happening in the background:
In this article, I’d like to focus on the third step: interpreting an address. This isn’t something that is restricted to delivery companies: there are many applications that involve finding out where exactly something is. This particular task is known as geocoding and it is part of a broader problem area known as geolocation.
But let’s get back to the specific problem and formulate it. We have a string that contains an address - nothing but the address. For example:
Level 7 3 Thomas Holt Drive North Ryde NSW 2113, Australia
We’d like to interpret that address. More specifically, we’d like to determine:
If that seems too easy, then this is the same address:
L7, 3 Thomas Holt Dr., Nth. Ryde, NSW 2113
The online services tend to give the best results, but it’s not practical to use them if, for example, you want to parse several million addresses in a hurry. And cheap. In that case, you’d have to reinvent the wheel.
Luckily, you don’t need to start from scratch. Many postal services maintain standards documents that dictate how exactly addresses need to be formatted (for example, USPS Publication 28). This document is a goldmine of information that contains goodies like:
By treating these documents as specifications, it’s possible to write a fairly robust parser for a particular country. Furthermore, the specifications of different countries share many common points. For example:
That’s all for today, but in future articles, I’ll be looking at the actual mechanics of parsing an address, as well as other essential supplementary materials. Stay tuned!