Wednesday, October 3, 2012

Building a Structured Transcription Tool with FreeUKGen

I'm currently working with FreeUKGen--the charity behind the genealogy database FreeBMD--to build a general-purpose, open-source tool for crowdsourced transcription of structured manuscript data into a searchable database.

We're basing our system on the Scribe tool developed for the Citizen Science Alliance for What's the Score at the Bodleian, which originated out of their experience building OldWeather and other citizen science sites.

We are building the following systems:
  1. A new tool for loading image sets into the Scribe system and attaching them to data-entry templates. 
  2. Modifications to the Scribe system to handle our volunteer organization's workflow, plus some usability enhancements.
  3. A publicly-accessible search-and-display website to mine the database created through data entry. 
  4. A reporting, monitoring, and coordinating system for our volunteer supervisors. 
We also plan to add support for geocoding during transcription and GIS support within the search and display system. Currently, initial development is mostly finished with 1 and moving on to 2 and 3 above.

Although this tool is focused on support for parish registers and census forms, we are intent on creating a general-purpose system for any tabular/structured data.   Scribe's data-entry templates are defined in its database, with the possibility to assign different templates to different images or sets of images.  As a result, we can use a simple template for a 1750 register of burials or a much more complex template for an 1881 census form.  Since each transcribed record is linked to the section of the page image it represents, we have the ability to display the facsimile version of a record alongside its transcript in a list of search results, or to get fancy and pre-populate a transcriber's form with frequently-repeated information like months or birthplaces.

Under the guidance of Ben Laurie, the trustee directing the project, we are committed to open source and open data.  We're releasing the source code under an Apache license and planning to build API access to the full set of record data.

We feel that the more the merrier in an open-source project, so we're looking for collaborators, whether they contribute code, funding, or advice.  We are especially interested in collaborators from archives, libraries, and the genealogy world.

2 comments:

Justin said...

BTW, I'm excited about what you're doing. Do you have a link to a working version?

Ben W. Brumfield said...

Thanks, Justin. I should have an invitation for the rootsdev group put together this afternoon, as soon as I get my ducks in a row license-wise.

Do you think that a demo on a rootsdev google hangout would be a good idea?