Weston Ruter | Open Scriptures

BibleTech:2011 Talk: “Distributed model for interlinked text development”

Here is the talk proposal I submitted for presenting at BibleTech:2011 which is going to be March 25-26 in Seattle:

Open Scriptures API: Distributed model for interlinked text development

In the continuing efforts at Open Scriptures to produce an open platform for the development of scriptural data and its applications, work continues on the most fundamental layer: the representation of the scriptural text. This talk will look at a normalized way to store text in a relational database as tokens and structures, and also how such a text can be versioned as it undergoes continuous improvements (e.g. making a translation or creating a critical text); as the text can exist in multiple revisions/editions, it can also be branched so as to introduce variant or alternate readings; it can also be forked to introduce a new derivative work linked back to the original. (Much of this will be applying lessons from the Git distributed version control system.) In addition to links between text editions, the representation of links between different texts (interlinearization) will be examined at the level a single scripture server and also in the context of a distributed network of scripture servers which all have texts that are potentially undergoing development and how the interconnections between the texts can be maintained to enable scriptural linked data.

This is a continuation of last year’s talk, Open Scriptures API: Unified Web Service for Scriptural Linked Data.

Linked Data as the Core Product

There’s been some recent buzz surrounding Bob Pritchett’s talk Network Effects Support Premium Pricing at the O’Reilly TOC Conference. The exciting thing about Bob’s message is that it is already at the core of what Open Scriptures is about. As Sean Boisen quotes from the Huffington Post:

When you purchase a book from [Logos], you’re not just getting a static ebook, you’re buying into a dynamic, integrated online application environment that becomes richer with each new publication, and with each new member to their community.

Further, Antoine Wright writes in The Network Effects of Bible Software:

Wish that MMM could take credit for this line of thought, but really, this is where mobile and web are going. The idea is that the effects of mature networks and platforms are going to turn traditional models of software ownership on its head. Those companies who lead or adapt quickly to this trend will find the business side of the connected economy easier to deal with. Those who wish to lock people into the former model will have a harder time growing marketshare, and might find their content – while the same as a network/platform – diminished in value because it cannot be extended by the user or user communities to draw even more relevance and value from it. Get your networks/platforms/apps ready, things are changing.

So the key offering of Logos Bible Software is not simply a collection of electronic resources, but rather it is the network of resources; that is, their key product is Linked Data, which is more than the sum of its parts. This is why Logos has the edge on all of the free resources out there: they have created links between these free resources, and it is these proprietary links that they sell as their product. Each link from a data point becomes a connection, a bridge that opens it up to a network of related information which adds an immeasurable amount of value to the data. So like Logos, the Open Scriptures project seeks to forge links between resources to create a network of data. Our aim diverges from Logos in that we want to make these links openly available for free under the Creative Commons Share-Alike license, and to provide a standardized API with which to develop applications on top of the linked scriptural data. Our goal is not to destroy Logos’ core product of linked data, but to provide a core subset of linked scriptural data that can be used to power applications of scripture, and to do so embracing the Web as the open platform.

“An Economic Argument for Free Primary Data”

Our colleague Efraim Feinstein at the Open Siddur Project wrote an excellent blog post on “An Economic Argument for Free Primary Data”. Here’re the introductory paragraphs:

There are two principles on which the success of data on the contemporary web rests: the web makes content available, and it adds value to that content by linking it to other related information.

When considering bringing old content online, both of these aspects are important. A first level of digitization involves simply making data available. Google Books and Hebrewbooks.org work at this level, providing PDFs and/or OCR-ed transcriptions of the material. A second level of digitization involves semantic linkage of the data, both internal to the site and external to the site. The Open Siddur Project, Tagged Tanakh and Open Scriptures digitize at the semantic level. This second-level digitization is required to do all of the cool things we expect to be able to do with online texts: click on a word and find its definition or grammatical form, find the source of a passage in one text in another text, find how the text has evolved historically, etc. Even the simplest form of a link: a reference from another site, requires some kind of internal division.

Digitization that takes advantage of the web therefore requires a number of steps: (1) getting the basic text online, (2) getting it in an addressable form (to make it more like typed text, instead of a picture of a page), (3) assuring the text’s accuracy, and (4) marking it up for semantic linkage. Some of these steps, or parts of them can be done automatically, but, overall, they require some degree of intelligent input. Even step 1, which is primarily mechanical in nature, requires design of the procedures.

I hope that this outline of the required steps to getting a text online suggests that the most expensive part of making content available is human labor — it takes time to do it, and it takes even more time to do it right.

Continue reading the rest of the post!

Stand-off Markup in OSIS

I have been researching stand-off markup, and discussing (and discussing) on osis-users about its applicability as a normalized way in OSIS to represent scripture in a single consistent XML structure. I was initially introduced to stand-off markup by Efraim Feinstein of the Open Siddur project and then further by James Tauber (1, 2, 3, 4, 5). I just came across the following quote in TEI P5: Non-hierarchical Structures:

It has been noted that stand-off markup has several advantages over embedded annotations. In particular, it is possible to produce annotations of a text even when the source document is read-only. Furthermore, annotation files can be distributed without distributing the source text. Further advantages mentioned in the literature are that discontinuous segments of text can be combined in a single annotation, that independent parallel coders can produce independent annotations, and that different annotation files can contain different layers of information. Lastly, it has also been noted that this approach is elegant.

But there are also several drawbacks. First, new stand-off annotated layers require a separate interpretation, and the layers — although separate — depend on each other. Moreover, although all of the information of the multiple hierarchies is included, the information may be difficult to access using generic methods.

In the current OSIS schema, one structure (e.g. verse, quote, or paragraph) has to be chosen as primary, leaving the others to be represented with milestoned elements (actually, only verses can currently be chosen as primary); stand-off markup would allow overlapping hierarchies of verses, quotes, and paragraphs to all be on equal footing without having to choose one as primary. Is this good for OSIS? Please join in on the discussions at osis-users!

The Directory and Fostering Collaboration

I have noticed in the Open community that we lack efficient communication and project visibility. Maybe communication is just difficult to begin with, and I bet we do better than purely proprietary top-secret enterprises. But in any case I’ve made the mistake in the past of setting out in isolation to work on some exciting, worthwhile project, but once I stopped and looked around at the community, I’d likely find others around me who are actively working on or who have already completed the very thing I wanted to do! Just think how much more could be done if instead of independently working on parallel projects we came alongside each other to consolidate efforts. In order for this to happen, we have to be able to know what each other is doing, and this is why we have recently launched the Open Scriptures Directory, which seeks to comprehensively list open projects involving scriptural data.

In addition to the Directory, the Google Group is also serving as a way for people to collaborate. This week there has been an active thread discussing collaboration on Strong’s Dictionary data. David Troidl expressed the general sentiments well:

The idea of collaboration is great. I’ve already mentioned to Darrell that just the moral support is heartening. It’s nice to know, after all this time working alone, that there really are others out there who share the same interest. […] I’ve been on my own all this time

I think it is essential to seek out others who are interested in the same things so that we can come together not only to build upon what each other has done, but to consolidate what we are doing as much as possible so that both duplicate efforts and fragmented data can be avoided. Not only will this lead to more productivity, but it will also serve to build community among those who share a common vision for open scriptural resources.

If you have a project that is not yet listed in the directory, please suggest the link. Please also join the Google Group!