December 04, 2013

Parsing postal addresses

The rapid pace of the world around us often makes us take many small miracles for granted. For example, I can go on Amazon Japan, order some gadget, and have it delivered to my door within days (eventually, within minutes). We only get to see the delivery guy (or girl, if you’re lucky) dropping off the package, but there are many things happening in the background:

  1. Amazon packages the item and prints my address on the outside.
  2. Amazon entrusts the package to some domestic delivery company (in Japan, typically Yamato or Sagawa).
  3. The delivery company interprets the address, in particular, looks for the postcode, prefecture, city and street address.
  4. The delivery company, which has depos all around Japan, sends the package to a depo location that is closest to me, based on the above.
  5. A driver from the depo goes out to the address and delivers the package.

In this article, I’d like to focus on the third step: interpreting an address. This isn’t something that is restricted to delivery companies: there are many applications that involve finding out where exactly something is. This particular task is known as geocoding and it is part of a broader problem area known as geolocation.

But let’s get back to the specific problem and formulate it. We have a string that contains an address - nothing but the address. For example:

Level 7
3 Thomas Holt Drive
North Ryde
NSW 2113, Australia

We’d like to interpret that address. More specifically, we’d like to determine:

  • the street number (3)
  • the street name (Thomas Holt Drive)
  • the city (North Ryde)
  • the state (New South Wales)
  • the postcode (2113)
  • the country (Australia)
  • any other relevant information (such as “Level 7” or PO box numbers)

If that seems too easy, then this is the same address:

L7, 3 Thomas Holt Dr., Nth. Ryde, NSW 2113

It turns out that this is a fairly well-discussed topic on StackOverflow. Here’s a brief summary of the answers:

The online services tend to give the best results, but it’s not practical to use them if, for example, you want to parse several million addresses in a hurry. And cheap. In that case, you’d have to reinvent the wheel.

Luckily, you don’t need to start from scratch. Many postal services maintain standards documents that dictate how exactly addresses need to be formatted (for example, USPS Publication 28). This document is a goldmine of information that contains goodies like:

  • Acceptable abbreviations for US street types and states
  • The expected order of address components
  • Mappings between English and Spanish address terms (for addresses in the US Territory of Puerto Rico).

By treating these documents as specifications, it’s possible to write a fairly robust parser for a particular country. Furthermore, the specifications of different countries share many common points. For example:

  • Addresses in English-speaking countries generally follow the same order (street, city, state, postcode, country)
  • Street types in English-speaking countries tend to be pretty common, such as the usual “street”, “road”, “avenue” and the more exotic “boulevard” and “esplanade”.
  • British and Canadian postcodes follow a similar pattern
  • … and many more.

That’s all for today, but in future articles, I’ll be looking at the actual mechanics of parsing an address, as well as other essential supplementary materials. Stay tuned!