July 23, 2009

A lesson in address storage

It's been a while since I've posted a blog entry because to be entirely honest, I've not had the opportunity to work on anything I considered blog-worthy.

Lately though, during my time on StackOverflow.com, I've noticed a number of questions regarding storage of address information - particularly with regards to storage of international addresses.

I've done a fairly significant amount of research on this and wanted to put together a single repository of go-to places for information so that I don't see the same questions come around time and again and I have to repeat the same information.

Firstly, if I see another database or suggestion of a schema that uses the format:

  • Address1
  • Address2
  • Address3
  • City
  • State
  • Zip
  • Country

I'm going to scream. Not only is this is a very U.S. centric way of storing addresses, but while it may be simple to accept the data it is all but impossible to QA it! I'm currently writing software for a logistics company that ships all over the world.  The database I've inherited stores addresses in this format and it's caused me so many headaches with data quality and verification that I want to slit my wrists on a daily basis.  Not to mention that most of the address data comes in from other companies through some quasi-EDI process that we have to be able to accept and enter into our system, which means that there is virtually no possibility of assuring the quality of the data to any reasonable level.

When I see databases designed like this I'm all but certain the designer never went through a relational database design course and has never heard of the concept of normalization. Either that or when they see address information they panic and throw everything they know about data normalization out of the window.

Let's take the following street address as example:

10 Maple Street

This is not one piece of information and thus it shouldn't be stored in one field. There are three distinct pieces of information: the street number, the street name and the street type. If you've have to parse this address from a postal database for the country you reside in on-the-fly, you're going to have performance problems if the data is all stored in a single field.

Let's take for instance, the Canada Post database, largely because that's the one I'm currently working with and because it's the smallest postal database I'm likely to be working with given that the population size of Canada is far smaller than those of either the U.S. or the UK. The performance impact of the issues I will address here will be greatly magnified in the larger address databases of those countries.

So firstly we have to parse out the pieces of address information, but how do we know which is which? It's easy to say grab the street number in that format, but what if we have:

1A-10 Maple St.

or

1A-10 Saint-Gerome St.

or

1A-10 Rue du Sainte-Gerome

or

10 Rue du Sainte-Gerome Apt 1A

Then you get into crazy addresses like

10 Rue du Saint-Gerome Apt 1A Rural Route 5

How are we going to parse addresses for quality control purposes in those instances? It quickly becomes a nightmare. Especially as you can't guarantee that information is going to be entered correctly - i.e. there could easily be typographic errors and spelling mistakes. There could also be information transposed in legitimate alternative formats.

For instance:

1A-10 Rue due Saint-Gerome RR#5

could easily be the same address as

10 Rue du Sainte-Gerome Apt 1A Rural Route 5

Notice the mis-spelling of Saint/Sainte in these two addresses, both are legitimite ways of spelling Saint(e). This could be spelled out long hand or short hand St. Ste. and then you have plurals too:

  • Saint
  • Saints
  • Sainte
  • Saintes
  • St.
  • Ste.
  • Sts.
  • Stes.

In addition to this, is St. Saint or Street? Is it actually Saint or does it take the feminine form Sainte? Should it be singular or plural?

Beware of other words that span gender and pluralization: Grand for instance can be spelled:

  • Grand
  • Grande
  • Grands
  • Grandes

In the Western world, we often steal names from other languages, consequently in the US you also have Spanish names to contend with so the rules for these also need to be taken into account.

If we had to parse the database for duplicate addresses, if the address was all stored in a single field "StreetAddress1" how on earth would we know? The simple answer is we wouldn't.

The fact is, that generic storage and QA of address information in a database is not a trivial matter, no matter how you look at it. This data should be normalized in the same way as any other data. A single piece of information should reside in a single field!

Once we've separated the data into distinct fields, the task of quality control becomes much simpler.

  • Street Number: 10
  • Street Type: Rue
  • (du) is a filler word that can safely be ignored
  • Street Name: Sainte-Gerome
  • Address Type: Rural Route
  • Address Type ID: 5

Now it's much easier to parse the information to check for correctness. If Saint appears in the street name, no matter how it's formed - we can validate it against the postal database using a simple regular expression comparison and replacement to the correct any spelling mistakes.

It's also much easier to check for typographic errors using distance algorithms and probability to correct spelling mistakes of a less specific nature. For instance, if the user types in Misisorga because they can't spell Mississauga, we could use an implementation of the Levenshtein algorithm to tell that the user probably meant Mississauga.

We can also test for probability of incorrect street names, numbers, postal codes.  For instance let's say I enter the address:

  • Street Number: 10
  • Street Name: Mississauga
  • Street Type: St.
  • Street Direction: S
  • City: Mississauga
  • Province: On
  • Postal Code: L5H 1L4

There is no such address according to the Canada Post database. We could just return an exception stating that this address is invalid, but that doesn't help the user any, what's wrong with the address? The postal code is incorrect for that street address, and should be: L5M 1K1. Of course, it might not be an incorrect postal code. So why not offer some alternatives to the user to make their life easier?

Assuming that the street address line is correct, the postal code is incorrect and should be L5M 1K1. But what if we make the assumption that the postal code is correct? There is also a 10 Queen St. W that matches this postal code, so maybe the user meant West instead of South. So there are equal possibilities that the postal code or the street direction could be right or wrong.

We could complicate matters a little further by analysing on a more detailed level, for instance S is a single keystroke that could be a simple typo, however the Levenshtein distance between the two postal codes (a distance of 4, rather than a distance of 1 between the S and W) shows that the user has provided far more than a typo. So in this instance, I would be inclined to think that the error in this address is more likely to be a typographic error (S instead of W) rather than an incorrect entry of a postal code.

During the course of my research, I've found that the most flexible generic format for address data is this:

  • Street Number [Int]
  • Street Number Suffix [VarChar] - A~Z 1/3 1/2 2/3 3/4 etc
  • Street Name [VarChar]
  • Street Type [VarChar] - Street, Road, Place etc. (I've found 262 unique street types in the English speaking world so far... and still finding them)
  • Street Direction [VarChar] - N, NE, E, SE, S, SW, W, NW
  • Address Type [VarChar] - For example Apartment, Suite, Office, Floor, Building etc.
  • Address Type Identifier [VarChar] - For instance the apartment number, suite, office or floor number or building identifier.
  • Minor Municipality (Village/Hamlet) [VarChar]
  • Major Municipality (Town/City) [VarChar]
  • Governing District (Province, State, County) [VarChar]
  • Postal Area (Postal Code/Zip/Postcode)[VarChar]
  • Country [VarChar]

Another caveat to be aware of is that most postal systems allow for alternate names - for instance, here in Mississauga we have a sub-municipality called Streetsville. Everyone that lives there refers to it as Streetsville and use Streetsville as the City/Town. It's actually Mississauga according to Canada Post, but Streetsville is a perfectly acceptable location to be used as part of the address, so this needs to be considered also.

150 Queen St S
Streetsville
ON L5M 1K8

and

150 Queen St S
Mississauga
ON L5M 1K8

As you might guess are exactly the same address, and when QAing our database, we need to be able to determine that.

Another thing you need to be aware of is streets that use street types as names. In Toronto for instance, there is a street called Avenue Road, and in Calgary there's a street called Centre Avenue. If you have all your street information in one field, there is no way to determine which word is the street name and which word is the street type. If you have the fields separated, it makes life much easier.

The key thing to note about different address systems around the world is that most of them store very similar pieces of information. It might look different when you see it printed out on an envelope, but the pieces of information when broken down are the same pieces. They're just laid out differently on the page - that is just formatting. Remember it is also far easier to store a simple address in a complex table than it is to store a complex address in a simple table. It is also far more of a pain to interpret programmatically.

Okay, this post is getting on to be long enough, so I'll end it there, and will post more on this topic later. This should be enough to scare you into not combining data fields in addresses for now, and that was my goal.

12 comments:

  1. After a short search through google, this article caught my eye, especially since it is so new. I have been looking for a better way of storing address that are going to comply better with the mass amount of EDI imports our company does. This seems like an excellent approach for a database design, however do our users really want to split what used to be 5 fields into a possible 12? Maybe over time the user interface can get smarter and more suggestive based on other fields already input.


    I am really interested in your design idea for only a US address.

    ReplyDelete
  2. Well, much like any other postal system, if you want to maintain data integrity, you need to train your users to enter data correctly and give them every opportunity to enter that data correctly. The fact is that right now, the users are lazy, they're used to being able to enter the data in whatever form they like. Meanwhile you're pulling your hair out trying to validate that data - and there's only so many smarts you can add to the system.

    Regular expressions, common formats and distance algorithms can only take you so far - especially if you want to get any kind of performance out of the system. You've got 50 states - 52 if you count Puerto Rico and the Virgin Islands, and then Military Zones too that are part of the U.S. postal system.

    You can continue to allow them to enter data into 5 fields but how many ways can you write the address line?

    Let's take something simple like an apartment in a building on a rural route:

    12A-150 Hwy 12 E Rural Route 1

    Okay, I lied, it's not simple at all, how many ways can this be written legitimately.

    150 Hwy 12 E, Apt 12A, RR#1
    Apt 12A, 150 Hwy 12 East, RR#1
    Apartment 12A, 150 Hwy 12 E, Rural Route 1
    RR1, 150 Hwy 12 East, Apt 12A

    and as many variations as you can think of - and users will enter them however they see fit at the time.

    How do you prevent duplicate addresses in a scenario that allows free text? You can't. There's no simple or smart way of guessing what the user means if you have to programmatically pull apart a sentence to figure it out.

    As humans, it's very easy for us to pull the pieces out and separate them into logical pieces but try converting what you can do in your head into a programmatic expression to separate the pieces into separate fields, check for spelling mistakes, suggest corrections etc.

    The simplest way is to split the pieces into separate fields and train the users not to be so lazy. It doesn't take much longer to hit the tab key between fields - sure it may be a little inconvenient at first but if your input form is designed properly it should be painless enough. It will also allow you to make the system more user friendly too, like suggest spelling corrections and zip code corrections.

    If I put my UX hat on, I can completely understand the user's viewpoint of only wanting 5 fields - but we can make the system far more robust, more user friendly and more useful to the user if they concede to splitting the fields up.

    There's no way we could've checked a zip code against a street address and suggested that the street number they've entered should probably be 100 instead of 10 - or the last 4 of the zip code they've entered is incorrect for the street name and suggest a correction - or the city name isn't spelled the way they entered it for the state/zip and we can actually tell them how it should be spelled rather than spit out a clueless error that the city is invalid and make them go hunting for the answer.

    So the extra (minor amount of pain) is worth the huge potential gains with little extra processing. It also makes it far easier for us to maintain the integrity of the address database and don't end up with many entry variations for a single address.

    ReplyDelete
  3. Your system is fairly flexible, but I'm not sure it's totally there. The subject is pretty complex.

    For instance, you have Address Type and Address Type Identifier, so your system can handle "1234 Main Street, Suite 100". But you might have multiple buildings sharing a street number, and each building might have a suite 100. In addition, within one suite, there could be mail stops (if it's a large enough building to need that). So you might have an address of the form: "1234 Main Street, Building A, Suite 100, Mail Stop 3". In other words, for a given street number, you kinda need multiple pairs.

    Also, regarding the general issue of splitting things up into finer-grained fields, you suggest an expanded set of fields (including street number, street name, street type) and then say, "Now it's much easier to parse the information to check for correctness." I don't mean to be picky, but *of course* it's easier to parse, because you've sort of punted on the problem of parsing. Parsing is the process of taking unstructured data and recovering the structure. You've basically said "my data is parsed by definition". I think it's important to recognize that that complexity exists somewhere. Maybe you're designing a database schema AND designing the UI by which data can be entered (as would often be the case in, say, web development), so you have the ability to move that complexity into the user experience. Might even be a good idea. Sadly, you don't always have the ability to dictate that the data you receive should be more structured.

    ReplyDelete
  4. Thanks for your comment Logan, this is an address format I haven't come across before and I'm not sure how common it is. It is certainly something I will have to consider for inclusion though. I appreciate your drawing my attention to it.

    With regards to the the comment you make about "parsing". In the context of finer granularity - "parsing" takes on a different role, instead of parsing out data format, I'm parsing for data correctness - i.e. spelling and validity. For instance, does the postal code tie up with the street address? is the street name spelled correctly? Do I have other possible matches within a degree of certainty? i.e. if the user has entered the name phonetically, there's a chance I have something that is within a degree of certainty in the database already.

    I appreciate your input though, thank you.

    ReplyDelete
    Replies
    1. Why is the ball in your court to ensure the address is correct in the first place? If people don't know how to write their own address, they ought to be asking for something to go wrong. Not saying this is a pointless task, but it does seem like overkill. You've got a lot of work on your hands if you intend to conquer the mistakes of address input in one country alone.

      Delete
    2. It's not always about the user's needs... sometimes a system isn't there for the user's benefit at all but for the needs of the owner of the system.

      Delete
  5. I came across this today http://www.oasis-open.org/committees/ciq/ciq.html#6 and thought it might be helpful to you reseach and future proposal.

    ReplyDelete
  6. You may want to look a the ADIS standard from IdeaAlliance. A number of highly experienced address parsing experts worked for a long time to generate this standard and it is a very complete way to parse addresses. Plus it works with UPU approved methods to put the addresses back together in ways that each local postal authority supports. I realize that this is a postal centric view of addresses, but that is what people are using them for in many cases. If you tag a Lat/Long on to them as well you can handle the physical locations that dont map well to an address.

    http://www.idealliance.org/industry_resources/production_and_supply_chain/adis

    ReplyDelete
  7. Great insights!!

    Thanks for posting your ideas around flexible address storage. It is both an interesting and frustrating topic.

    For your recommended 12 columns, would you mind elaborating on:

    - recommended field lengths
    - do you recommend foreign key relationships for columns like Country, Governing District?
    - could you post some sample data of sufficient variance to exercise each column (over the data set)?

    ReplyDelete
  8. I think with address data you have to tailor the storage strategy to the lowest common denominator of the input you're going to receive. In the case of users entering their address details on a website, that's more than likely going to be some semi-illiterate, fat American dude bashing his chubby fingers against the keyboard. You might as well just accept the fact that you're rarely going to have clean data :)

    ReplyDelete
  9. In the US often we must specify both the County and State. I'd suggest that "Governing District" become two variables to allow that.

    ReplyDelete
  10. Thankyou! Well it is 2012 now and the mentioned field design is pretty much the same but geocoders became stronger and stronger. What we do on top is to get a normalized address from google (100.000/day) or yahoo (unlimited) geo api.
    Plus that we have the geonames.org in our database.

    ReplyDelete

There was an error in this gadget