July 23, 2009

A lesson in address storage

It's been a while since I've posted a blog entry because to be entirely honest, I've not had the opportunity to work on anything I considered blog-worthy.

Lately though, during my time on StackOverflow.com, I've noticed a number of questions regarding storage of address information - particularly with regards to storage of international addresses.

I've done a fairly significant amount of research on this and wanted to put together a single repository of go-to places for information so that I don't see the same questions come around time and again and I have to repeat the same information.

Firstly, if I see another database or suggestion of a schema that uses the format:

  • Address1
  • Address2
  • Address3
  • City
  • State
  • Zip
  • Country

I'm going to scream. Not only is this is a very U.S. centric way of storing addresses, but while it may be simple to accept the data it is all but impossible to QA it! I'm currently writing software for a logistics company that ships all over the world.  The database I've inherited stores addresses in this format and it's caused me so many headaches with data quality and verification that I want to slit my wrists on a daily basis.  Not to mention that most of the address data comes in from other companies through some quasi-EDI process that we have to be able to accept and enter into our system, which means that there is virtually no possibility of assuring the quality of the data to any reasonable level.

When I see databases designed like this I'm all but certain the designer never went through a relational database design course and has never heard of the concept of normalization. Either that or when they see address information they panic and throw everything they know about data normalization out of the window.

Let's take the following street address as example:

10 Maple Street

This is not one piece of information and thus it shouldn't be stored in one field. There are three distinct pieces of information: the street number, the street name and the street type. If you've have to parse this address from a postal database for the country you reside in on-the-fly, you're going to have performance problems if the data is all stored in a single field.

Let's take for instance, the Canada Post database, largely because that's the one I'm currently working with and because it's the smallest postal database I'm likely to be working with given that the population size of Canada is far smaller than those of either the U.S. or the UK. The performance impact of the issues I will address here will be greatly magnified in the larger address databases of those countries.

So firstly we have to parse out the pieces of address information, but how do we know which is which? It's easy to say grab the street number in that format, but what if we have:

1A-10 Maple St.

or

1A-10 Saint-Gerome St.

or

1A-10 Rue du Sainte-Gerome

or

10 Rue du Sainte-Gerome Apt 1A

Then you get into crazy addresses like

10 Rue du Saint-Gerome Apt 1A Rural Route 5

How are we going to parse addresses for quality control purposes in those instances? It quickly becomes a nightmare. Especially as you can't guarantee that information is going to be entered correctly - i.e. there could easily be typographic errors and spelling mistakes. There could also be information transposed in legitimate alternative formats.

For instance:

1A-10 Rue due Saint-Gerome RR#5

could easily be the same address as

10 Rue du Sainte-Gerome Apt 1A Rural Route 5

Notice the mis-spelling of Saint/Sainte in these two addresses, both are legitimite ways of spelling Saint(e). This could be spelled out long hand or short hand St. Ste. and then you have plurals too:

  • Saint
  • Saints
  • Sainte
  • Saintes
  • St.
  • Ste.
  • Sts.
  • Stes.

In addition to this, is St. Saint or Street? Is it actually Saint or does it take the feminine form Sainte? Should it be singular or plural?

Beware of other words that span gender and pluralization: Grand for instance can be spelled:

  • Grand
  • Grande
  • Grands
  • Grandes

In the Western world, we often steal names from other languages, consequently in the US you also have Spanish names to contend with so the rules for these also need to be taken into account.

If we had to parse the database for duplicate addresses, if the address was all stored in a single field "StreetAddress1" how on earth would we know? The simple answer is we wouldn't.

The fact is, that generic storage and QA of address information in a database is not a trivial matter, no matter how you look at it. This data should be normalized in the same way as any other data. A single piece of information should reside in a single field!

Once we've separated the data into distinct fields, the task of quality control becomes much simpler.

  • Street Number: 10
  • Street Type: Rue
  • (du) is a filler word that can safely be ignored
  • Street Name: Sainte-Gerome
  • Address Type: Rural Route
  • Address Type ID: 5

Now it's much easier to parse the information to check for correctness. If Saint appears in the street name, no matter how it's formed - we can validate it against the postal database using a simple regular expression comparison and replacement to the correct any spelling mistakes.

It's also much easier to check for typographic errors using distance algorithms and probability to correct spelling mistakes of a less specific nature. For instance, if the user types in Misisorga because they can't spell Mississauga, we could use an implementation of the Levenshtein algorithm to tell that the user probably meant Mississauga.

We can also test for probability of incorrect street names, numbers, postal codes.  For instance let's say I enter the address:

  • Street Number: 10
  • Street Name: Mississauga
  • Street Type: St.
  • Street Direction: S
  • City: Mississauga
  • Province: On
  • Postal Code: L5H 1L4

There is no such address according to the Canada Post database. We could just return an exception stating that this address is invalid, but that doesn't help the user any, what's wrong with the address? The postal code is incorrect for that street address, and should be: L5M 1K1. Of course, it might not be an incorrect postal code. So why not offer some alternatives to the user to make their life easier?

Assuming that the street address line is correct, the postal code is incorrect and should be L5M 1K1. But what if we make the assumption that the postal code is correct? There is also a 10 Queen St. W that matches this postal code, so maybe the user meant West instead of South. So there are equal possibilities that the postal code or the street direction could be right or wrong.

We could complicate matters a little further by analysing on a more detailed level, for instance S is a single keystroke that could be a simple typo, however the Levenshtein distance between the two postal codes (a distance of 4, rather than a distance of 1 between the S and W) shows that the user has provided far more than a typo. So in this instance, I would be inclined to think that the error in this address is more likely to be a typographic error (S instead of W) rather than an incorrect entry of a postal code.

During the course of my research, I've found that the most flexible generic format for address data is this:

  • Street Number [Int]
  • Street Number Suffix [VarChar] - A~Z 1/3 1/2 2/3 3/4 etc
  • Street Name [VarChar]
  • Street Type [VarChar] - Street, Road, Place etc. (I've found 262 unique street types in the English speaking world so far... and still finding them)
  • Street Direction [VarChar] - N, NE, E, SE, S, SW, W, NW
  • Address Type [VarChar] - For example Apartment, Suite, Office, Floor, Building etc.
  • Address Type Identifier [VarChar] - For instance the apartment number, suite, office or floor number or building identifier.
  • Minor Municipality (Village/Hamlet) [VarChar]
  • Major Municipality (Town/City) [VarChar]
  • Governing District (Province, State, County) [VarChar]
  • Postal Area (Postal Code/Zip/Postcode)[VarChar]
  • Country [VarChar]

Another caveat to be aware of is that most postal systems allow for alternate names - for instance, here in Mississauga we have a sub-municipality called Streetsville. Everyone that lives there refers to it as Streetsville and use Streetsville as the City/Town. It's actually Mississauga according to Canada Post, but Streetsville is a perfectly acceptable location to be used as part of the address, so this needs to be considered also.

150 Queen St S
Streetsville
ON L5M 1K8

and

150 Queen St S
Mississauga
ON L5M 1K8

As you might guess are exactly the same address, and when QAing our database, we need to be able to determine that.

Another thing you need to be aware of is streets that use street types as names. In Toronto for instance, there is a street called Avenue Road, and in Calgary there's a street called Centre Avenue. If you have all your street information in one field, there is no way to determine which word is the street name and which word is the street type. If you have the fields separated, it makes life much easier.

The key thing to note about different address systems around the world is that most of them store very similar pieces of information. It might look different when you see it printed out on an envelope, but the pieces of information when broken down are the same pieces. They're just laid out differently on the page - that is just formatting. Remember it is also far easier to store a simple address in a complex table than it is to store a complex address in a simple table. It is also far more of a pain to interpret programmatically.

Okay, this post is getting on to be long enough, so I'll end it there, and will post more on this topic later. This should be enough to scare you into not combining data fields in addresses for now, and that was my goal.