Archive for February, 2012|Monthly archive page

More on the place matcher

I recently wrote a long comment concerning developing a place database and I think it’s worth repeating here:

There are a lot of online gazetteers: http://www.alexandria.ucsb.edu/~lhill/dgie/DGIE_website/gaz_links.htm lists several.

I looked at a number of these when creating the place database for WeRelate.org, which is now available as a free download:

http://github.com/DallanQ/Places

It includes both current and historical places, alternate names, many places list both their historical and modern jurisdictional hierarchies, and many places include coordinates.

* Geonames: GeoNames.org Lots of places, modern only (or mostly), most places are geographic features like lakes and rivers, but places are in a flat hierarchy — that is, cities in England did not list the county they are in. Having a hierarchy is pretty important – how do you know which Sutton in England to match when the user says “Sutton, Bedfordshire, England”? There are a dozen different Sutton’s in their database for England, and you don’t have any way to determine which Sutton is in Bedfordshire, except by calculating shortest distance from each Sutton to the centroid listed for Bedfordshire – not very reliable. Because of the lack of hierarchy, I ended up not using this resource. I wasn’t aware that they had included historical support, though it appears still in the very early stages. They’ve added an “isHistorical” flag for names that are no longer used, and are considering adding fromPeriod and toPeriod. Until they add jurisdictional hierarchies to their database, they won’t have even scratched the surface of historical issues though.

* Getty Thesaurus of Geographic Names: http://www.getty.edu/research/tools/vocabularies/tgn/ Smaller than Geonames, around 1.7M names for 992K distinct places, mostly modern, though more historical places than Geonames, most places are geographic features, places are in a hierarchy(!), data compiled from about a dozen different sources: mainly NGA/NIMA but also Rand McNally, Encyclopedia Britannica, Domesday book, generally lists places under the jurisdictional hierarchy they appeared in about 12 years ago. I got permission to include their populated places and political jurisdictions into the WeRelate place database. More information: http://www.getty.edu/research/tools/vocabularies/tgn/about.html and http://www.getty.edu/research/tools/vocabularies/tgn/faq.html

* Alexandria Digital Library Gazetteer: http://www.alexandria.ucsb.edu/gazetteer/ContentStandard/version3.2/GCS3.2-guide.htm I obtained a license to this as well, but after reviewing it, it seemed similar to Getty so I did not use it.

* Family History Library Catalog: The only resource I was able to find with historical places. Most (but not all) places are listed according to the jurisdictions they were in just prior to WWI. There are some duplicates: some places listed under Galicia are repeated under Poland for example. I crawled the the FHLC place database back in 2005 and included it in the WeRelate place database.

* Wikipedia: Both current and historical places. A terrific source of information, but difficult to extract. I extracted 10’s of thousands of places (certainly not all of them, but the ones that had decent templates for extraction) back in 2005 and included them in the WeRelate place database. A side-benefit of incorporating Wikipedia is that the database includes links back to the wikipedia articles, which often have helpful historical information. (Though the links aren’t included in the extract on github; I’ll fix this shortly.)

* Freebase.com: http://www.freebase.com/view/location updated database of places they’ve extracted from Wikipedia. Includes about 80,000 current and historical places. I’d love to integrate this into the WeRelate place database, though it will be a big project (see below).

* OpenStreetMap: http://www.openstreetmap.org/ has coordinate information for modern places, and places are arranged into a hierarchy(!), I’d like to use this to fill in missing coordinates into the place database at WeRelate.org.

* Statoids.com: http://statoids.com/ not a place database per se, but a fantastic source of information for how jurisdictions have changed over time. I used this and wikipedia and Encyclopedia Britannica when compiling the WeRelate place database (see below).

The big challenge when creating a place database is not getting the data — as you can see, there are many sources for that. It’s merging data together from multiple sources *without creating duplicates*. You want to say that City X in Historical Province Y from the FHLC is the same as City X’ in Modern State Z in Wikipedia. Merging duplicate places is generally harder than merging duplicate people, because place names can change dramatically after wars. Even merging Getty and Wikipedia was challenging, because of the changes European countries have made to their jurisdictional hierarchies over the past 10 years due to the EU. I spent months merging Getty, FHLC, and Wikipedia together, and WeRelate users have spent the past seven years continuing to clean it up and organize it better afterward. If you’re going to try to create your own current+historical place database, take the merge-time into account. Or just use the free one I posted on github.

I recently matched 7.5M places appearing in the 7K gedcoms submitted to WeRelate over the past five years to see what kinds of problems were occurring most frequently:

* We don’t have comprehensive coverage for US townships. This is on my short-list of things to add.
* We still have duplicate places in Eastern Europe due to FHLC having duplicates that were not caught.
* We still don’t have all of the historical and modern places in Europe merged (though many have been merged).
* We don’t have all of the historical jurisdictions listed.
* We’re missing some places (though not that many).

I just posted this a couple of weeks ago, so there may still be some rough edges. I know of at least one other organization who’s using it already, and I’m talking with several other organizations who are interested. I’m making it freely available so that others don’t have to go through the pain that I did.

Advertisements