I’m launching a new website today: Genealogy Gophers. The goal is to be Google Books for genealogy:
- Completely free. The site is supported by ads and “Google Consumer Surveys” – people are asked to answer a few market-research questions once a day in order to read and download the books.
- Only genealogy-relevant books: 40,000 and growing. These out-of-copyright books have been obtained from FamilySearch, the Allen County Public Library, the Mid-Continent Public Library, and several other libraries.
- Advanced search technology allows searches for people: the names, dates, places, and relatives associated with them, not just words.
- Search results include snippets from the book pages so it’s easy to quickly scan the results and find the most-relevant ones.
The website is in beta. I soft-launched it at RootsTech and got some terrific feedback. We were able to find information on ancestors of 30-40% of the people who visited the booth. In a few cases it was information they’d never known before! I’ve spent the last two weeks improving the site and am finally read to officially launch. Over the coming months I’ll be continuing to improve the search algorithms and adding another 60,000 books.
I’m hoping this is a big hit. Free access to records in exchange for answering a few survey questions could open a new avenue for more subscription-free genealogy websites.
Here’s the press release. My good friend Bob Sherwin wrote it!
I saw nearly two dozen vendors at RootsTech last week, demoing websites or mobile apps or both. Most of them need to build the same basic framework: import a gedcom, allow someone to add to the tree, list the people in the tree, create a pedigree view, etc. Once they have this basic framework in place, then they can start to work on whatever it is that makes their offering unique.
Passport (http://passportjs.org/) is an authentication library for node.js. Passport is similar to everyauth, but in my opinion it’s cleaner. The author, Jared Hanson, just added an authentication strategy for FamilySearch (https://github.com/jaredhanson/passport-familysearch).
What’s amazing, and the reason that I love being a developer, is that Jared added this out of the goodness of his heart after seeing a tweet that I sent to Tim Shadel about his pull request (https://github.com/bnoguchi/everyauth/pull/85) to add support for FamilySearch to everyauth. Tim’s pull request still hasn’t been integrated into everyauth, but Jared added support for FamilySearch into Passport in under a week.
I recently wrote a long comment concerning developing a place database and I think it’s worth repeating here:
There are a lot of online gazetteers: http://www.alexandria.ucsb.edu/~lhill/dgie/DGIE_website/gaz_links.htm lists several.
I looked at a number of these when creating the place database for WeRelate.org, which is now available as a free download:
It includes both current and historical places, alternate names, many places list both their historical and modern jurisdictional hierarchies, and many places include coordinates.
* Geonames: GeoNames.org Lots of places, modern only (or mostly), most places are geographic features like lakes and rivers, but places are in a flat hierarchy — that is, cities in England did not list the county they are in. Having a hierarchy is pretty important – how do you know which Sutton in England to match when the user says “Sutton, Bedfordshire, England”? There are a dozen different Sutton’s in their database for England, and you don’t have any way to determine which Sutton is in Bedfordshire, except by calculating shortest distance from each Sutton to the centroid listed for Bedfordshire – not very reliable. Because of the lack of hierarchy, I ended up not using this resource. I wasn’t aware that they had included historical support, though it appears still in the very early stages. They’ve added an “isHistorical” flag for names that are no longer used, and are considering adding fromPeriod and toPeriod. Until they add jurisdictional hierarchies to their database, they won’t have even scratched the surface of historical issues though.
* Getty Thesaurus of Geographic Names: http://www.getty.edu/research/tools/vocabularies/tgn/ Smaller than Geonames, around 1.7M names for 992K distinct places, mostly modern, though more historical places than Geonames, most places are geographic features, places are in a hierarchy(!), data compiled from about a dozen different sources: mainly NGA/NIMA but also Rand McNally, Encyclopedia Britannica, Domesday book, generally lists places under the jurisdictional hierarchy they appeared in about 12 years ago. I got permission to include their populated places and political jurisdictions into the WeRelate place database. More information: http://www.getty.edu/research/tools/vocabularies/tgn/about.html and http://www.getty.edu/research/tools/vocabularies/tgn/faq.html
* Alexandria Digital Library Gazetteer: http://www.alexandria.ucsb.edu/gazetteer/ContentStandard/version3.2/GCS3.2-guide.htm I obtained a license to this as well, but after reviewing it, it seemed similar to Getty so I did not use it.
* Family History Library Catalog: The only resource I was able to find with historical places. Most (but not all) places are listed according to the jurisdictions they were in just prior to WWI. There are some duplicates: some places listed under Galicia are repeated under Poland for example. I crawled the the FHLC place database back in 2005 and included it in the WeRelate place database.
* Wikipedia: Both current and historical places. A terrific source of information, but difficult to extract. I extracted 10’s of thousands of places (certainly not all of them, but the ones that had decent templates for extraction) back in 2005 and included them in the WeRelate place database. A side-benefit of incorporating Wikipedia is that the database includes links back to the wikipedia articles, which often have helpful historical information. (Though the links aren’t included in the extract on github; I’ll fix this shortly.)
* Freebase.com: http://www.freebase.com/view/location updated database of places they’ve extracted from Wikipedia. Includes about 80,000 current and historical places. I’d love to integrate this into the WeRelate place database, though it will be a big project (see below).
* OpenStreetMap: http://www.openstreetmap.org/ has coordinate information for modern places, and places are arranged into a hierarchy(!), I’d like to use this to fill in missing coordinates into the place database at WeRelate.org.
* Statoids.com: http://statoids.com/ not a place database per se, but a fantastic source of information for how jurisdictions have changed over time. I used this and wikipedia and Encyclopedia Britannica when compiling the WeRelate place database (see below).
The big challenge when creating a place database is not getting the data — as you can see, there are many sources for that. It’s merging data together from multiple sources *without creating duplicates*. You want to say that City X in Historical Province Y from the FHLC is the same as City X’ in Modern State Z in Wikipedia. Merging duplicate places is generally harder than merging duplicate people, because place names can change dramatically after wars. Even merging Getty and Wikipedia was challenging, because of the changes European countries have made to their jurisdictional hierarchies over the past 10 years due to the EU. I spent months merging Getty, FHLC, and Wikipedia together, and WeRelate users have spent the past seven years continuing to clean it up and organize it better afterward. If you’re going to try to create your own current+historical place database, take the merge-time into account. Or just use the free one I posted on github.
I recently matched 7.5M places appearing in the 7K gedcoms submitted to WeRelate over the past five years to see what kinds of problems were occurring most frequently:
* We don’t have comprehensive coverage for US townships. This is on my short-list of things to add.
* We still have duplicate places in Eastern Europe due to FHLC having duplicates that were not caught.
* We still don’t have all of the historical and modern places in Europe merged (though many have been merged).
* We don’t have all of the historical jurisdictions listed.
* We’re missing some places (though not that many).
I just posted this a couple of weeks ago, so there may still be some rough edges. I know of at least one other organization who’s using it already, and I’m talking with several other organizations who are interested. I’m making it freely available so that others don’t have to go through the pain that I did.
I’ve developed an open-source place matcher for genealogy. It takes place texts provided in GEDCOM files and matches them to standardized place names. It’s pretty good. I’ll be giving a talk on it at RootsTech next week. The source code and database are freely available on Gitbub.
Thanks go to Ryan Knight for creating a web application to demonstrate it.
I’ve developed an open-source name-variants database. We’re currently using it at WeRelate.org. This is a better algorithm than Soundex for matching variant names like Ann, Anna, Annah, Anne, Annie, etc. It results in a 28% reduction in missed variants compared to soundex, based upon a set of 100,000 pairs of names provided by Ancestry.com. I’ll be giving a talk on it at RootsTech next week. The source code and database are freely available on Gitbub.
I just posted the first cut of an open-source GEDCOM parse. The parser parses GEDCOM files into a de facto object model, which is able to represent nearly all of the tag sequences found in real-world GEDCOM files. The object model includes common custom tags; other tags are represented as extensions. The object model has a JSON representation, and the toolkit includes a GEDCOM exporter. This makes it possible for anyone to read a GEDCOM file, manipulate its contents, save it to JSON, and export it back to a GEDCOM file, without loss of information for the vast majority of GEDCOMs.
Ryan Knight has created a demo server to show it off.
For more information, see the Github repo.
Read the article. Not the quote I would have chosen, but it emphasizes WeRelate’s focus on sources. WeRelate received a lot of new traffic and activity because of it.
Evidence and sources makes genealogy both more fun – finding your ancestors in sources is fun, and accurate – recording your sources makes your work verifiable. I believe future genealogy programs should focus more on finding and recording evidence.
Doing genealogy would be less expensive and more fun if more people got involved.
Recently an LDS Church leader encouraged more youth to work on genealogy, because they’re more familiar with technology that is now practically required to do genealogy. But we don’t need to encourage youth to get on Facebook or play FarmVille, and even many older non-tech play social games. Why? Because they’re fun.
A big problem with people starting to do genealogy is that they don’t know how to begin. In games this is called “onboarding” – what happens during the first few minutes of play. Games focus on onboarding and on directing the player to increasingly challenging experiences that make playing the game fun. The current flock of genealogy programs are largely dressed-up database managers. They don’t lead you to do what to do next, and don’t reward you in some way when you do it. We need to make doing genealogy more like playing a game. The process of look for a record, attach it to the tree, get rewarded, and get direction on the next record to look for needs to be a core part of the experience.