Collaborative Manuscript Transcription: July 2007

Wednesday, July 25, 2007

Money: Possible Revenue Sources

Let's return to the subject of money. If I thought the market for collaborative manuscript transcription software were a lucrative one, I might launch a business based on selling FromThePage (If you wonder why I keep referring to different software products, it's because I'm trying out different names to see how they feel. See my naming post for more details.)

I sincerely doubt that the market could sustain even a single salary, but it's still a useful exercise to list revenue opportunities for a hypothetical "Renan Systems, Inc."

Fee-for-hosting (including microsponsorships)
This is the standard model for the companies selling what we used to call Application Service Providers and now call something high-falutin' like "Software as a Service." In this model, a user with deep pockets pays a monthly fee to use the software. Their data is stored by the host (Today's Name, Inc.), and in addition to whatever access they have, the public can see whatever the client wants them to see.

This is the most likely model I see for FromThePage. Work owners would upload their documents and be charged some monthly or yearly rate. Neither scribes nor viewers would pay anything — all checks would come from the work owner.

There's one exception to this single-payer-per-work rule, and it's a pretty neat one. One of the panels at South by Southwest this year discussed microsponsorships (see podcast here, beginning at around 8 minutes). This is an idea that's been used to allow an audience to fund an independent content provider: you like X's podcasts, so you donate $25 and your name appears beside your favorite podcast. The $25 goes to X, presumably to support X's work.

The nature of family history suggests microsponsorships as an excellent way for a work owner to fund a site. The people involved in a family history project have amazingly diverse sets of skills and resources. One person may have loads of interpretive context but poor computer skills and a dial-up connection. Another may have loads of time to invest, but no money. And often there is a family member with great interest but little free time, who's willing to write a check to see a project through. Microsponsorship allows that last person to enable the previous two, advancing the project for them all.

Donations
Hey, it works for Wikipedia!

License to other hosting providers
Perhaps I don't want to get into the hosting business at all. After all, technical support and billing are hassles, and I've never had to deal with accounting before. If there were commercial value to FromThePage, another hosting company might buy the software as an add-on.

Where this really does become a possiblity is for the big genealogy sites to add FromThePage to their existing hosting services. I can see licensing the software to one of them simultaneously with hosting my own.

Affiliate sales
The only candidate for this is publish-on-demand printing, a service I'd like to offer anyway. For finished manuscript transcriptions, in addition to a PDF download, I'd like to offer the ability to print the transcription and have it bound and shipped. Plenty of services exist to do this already given a PDF as an input, so I can't imagine it would be too hard to hook into one of them.

Advertising
Ugh. It's hard to see much commercial value to an advertiser, unless the big genealogy sites have really impressive, context-sensitive APIs. And besides, ugh.

Thursday, July 12, 2007

Feature: Subect Graphs

Inspired by Bill Turkel's bibliography clusters, I spent the last week playing with GraphViz. Here is a relatedness graph generated by the subject page for my great grandfather:

One big problem with this graph is that it shows too many subjects. Once I add subject categorization, a filtered graph becomes a way of answering interesting questions. Viewing the graph for "cold" and only displaying subjects categorized as "activities" could tell you what sort of things took place on the farm in winter weather, for example.

Another issue is that the graph doesn't effectively display information. Nodes vary based on size and distance. Size is merely a function of the label length, so is meaningless. Distance from the subject is the result of the "relatedness" between the node and the subject, measured by page occurrences. This latter metric is what I'm trying to calculate, but it's not presented very well. I'll probably try to reinforce the relatedness by varying color intensity using the Graphviz color schemes, or suppress the label length by forcing nodes into a fixed size or shape.

Of course, common page links are only one way to relate subjects. It's easier, if less interesting, to graph the direct link between one subject article and another. Given more data, I could also show relationships between users based on the pages/articles/works they'd edited in common, or the relationships between works based on their shared editors. A final thing to explore is the Graphviz ImageMap output format, which apparently results in clickable nodes.

I'll put up a technical post on the subject once I split the dot-generation out into Rails' Builder — currently it's just a mess of string concatenation.

Feature: Transcription Versions

Page/Subject Article Versions
Last week I added versioning to articles and pages. The goal was to allow access to previous edits via a version page akin to the MediaWiki history tab.

Gavin Robinson suggested a system of review and approval before transcription changes go live, but I really think that this doesn't fit well with my user model. For one thing, I don't expect the same kinds of vandalism problems you see in Wikipedia to affect FromThePage works much, since the editors are specifically authorized by the work owner. For another, I can't imagine the solo-owner/scribe would tolerate having to submit and approve each one of their edits for long. Finally, since this is designed for a loosely-coupled, non-institutional user community, I simply can't assume that the work owner will check the site each day to review and approve changes. Projects must be able to keep their momentum without intervention by the work owner for months at a time.

His concerns are quite valid, however. Perhaps an alternative approach to transcription quality is to develop a few more owner tools, like a bulk review/revert feature for contributions made since a certain date or by a certain user.

Work Versions
Later, I'll put up a technical post on how I accomplished this with Rails after_save callbacks, but for now I'd like to talk about "versions" of a perpetually-editable work. What exactly does this mean? If a user prints out or downloads a transcription between one change and the next, how do you indicate that?

To address this, I decided to add the concept of a work's "transcription version". This is an additional attribute of the work itself, and every time an edit is made to any one of the work's pages, the work itself has its transcription version incremented. By recording the transcription version of the work in the page version record as well, I should be able to reconstruct the exact state of the digital work from a number added to an offline copy of the work.

I decided on transcription_version as an attribute name because comments and perhaps subject articles may change independently of the work's transcribed text. A printout that includes commentary needs a comment_version as well as a transcription_version. The two attributes seem orthogonal, because two transcription-only prints of the same work shouldn't appear different because a user has made an unprinted annotation.

Sunday, July 1, 2007

UI and Other Fun Stuff

Sara and I spent about an hour yesterday talking about the Renan System's UI on our drive to Houston. For someone who is frightened and confused by user interface design, this turned out to be surprisingly pleasant. She's recommended a two-column design, since the transcription page is necessarily broken into one column for the transcription form and one column for the manuscript page image. In coding things like article links, I find that this layout is generalizable given the structure of most of my data: ordinarily, text (the "important" stuff) lives in the leftmost pane, while images, links, and status panels live in the right.

Forcing myself to think about what to put in the right pane turned into a brainstorming session. When a user views an article, it seems obvious that a list of pages that link to that article should be visible. But thinking "what else goes in the article's 'fun stuff' pane?" made me realize that there were all sorts of ways to slice the data. For example, I could display a list of articles mentioned most frequently in pages that mentioned the viewed article. Rank this by number of occurrences, or rank it by percentage of occurances, and you get very different results: "What does it mean if 'Franklin' only occurs with 'Edna', but never in any other context?" I could also display timelines of article links, or which scribe had contributed the most to the article.

Going through the same process for the work page, the page list page, the scribe or viewer dashboards and such has generated a month's worth of feature ideas. Maybe there's something to this UI design after all!

Collaborative Manuscript Transcription

Wednesday, July 25, 2007

Money: Possible Revenue Sources

Thursday, July 12, 2007

Feature: Subect Graphs

Feature: Transcription Versions

Sunday, July 1, 2007

UI and Other Fun Stuff

New Blog Posts are at FromThePage

Posts from the FromThePage Blog

Pages

Upcoming Conference Schedule

Past Conference Talks

Blog Archive

Subjects

Papers

Transcription Systems

Digital Family History