Open Scriptures

BibleTech:2011 Talk: “Distributed model for interlinked text development”

Here is the talk proposal I submitted for presenting at BibleTech:2011 which is going to be March 25-26 in Seattle:

Open Scriptures API: Distributed model for interlinked text development

In the continuing efforts at Open Scriptures to produce an open platform for the development of scriptural data and its applications, work continues on the most fundamental layer: the representation of the scriptural text. This talk will look at a normalized way to store text in a relational database as tokens and structures, and also how such a text can be versioned as it undergoes continuous improvements (e.g. making a translation or creating a critical text); as the text can exist in multiple revisions/editions, it can also be branched so as to introduce variant or alternate readings; it can also be forked to introduce a new derivative work linked back to the original. (Much of this will be applying lessons from the Git distributed version control system.) In addition to links between text editions, the representation of links between different texts (interlinearization) will be examined at the level a single scripture server and also in the context of a distributed network of scripture servers which all have texts that are potentially undergoing development and how the interconnections between the texts can be maintained to enable scriptural linked data.

This is a continuation of last year’s talk, Open Scriptures API: Unified Web Service for Scriptural Linked Data.

Cryptographic hashes and RESTful URIs

In a recent post to the Open Scriptures mailing list, it was suggested that we use md5 (or another cryptographic hash) to generate unique IDs for each token (a “token” is the fundamental unit of text (most often a word) in our API database models). Today we discussed the implementation of this on IRC, and it was fairly stimulating.

First of all, md5 is broken and deprecated, due to possible collisions (two different pieces of data can result in the same hash). Since we will be dealing with millions of tokens, we decided not to test our luck, unlikely though a problem may be. SHA-256 has no known collisions, so we decided it was best to use that algorithm.

SHA-256 is implemented in Python’s standard library hashlib, so that is good. For exapmle:

>>> import hashlib >>> hashlib.sha256("Hello world!").digest() '\xc0S^K\xe2\xb7\x9f\xfd\x93)\x13\x05Ck\xf8\x891NJ?\xae\xc0^\xcf\xfc\xbb}\xf3\x1a\xd9\xe5\x1a'

Needless to say, such a digest would not be very good for use in a RESTful URI scheme. So, hashlib also offers a hexadecimal option:

>>> hashlib.sha256("Hello world!").hexdigest() 'c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a'

That is still not the best, since that makes for a very long string. So, we have the option of using base64 encoding:

>>> import base64 >>> base64.b64encode(hashlib.sha256("Hello world!").digest()) 'wFNeS+K3n/2TKRMFQ2v4iTFOSj+uwF7P/Lt98xrZ5Ro='

That is shorter, but it includes the “/” character, which is a no-no for URI design. Luckily base64 includes a function for this exact purpose:

>>> base64.urlsafe_b64encode(hashlib.sha256("Hello world!").digest()) 'wFNeS-K3n_2TKRMFQ2v4iTFOSj-uwF7P_Lt98xrZ5Ro='

So that is safe for URIs, but being case-sensitive and having ambiguous characters makes it not the best for working with. So, base32 to the rescue:

>>> base64.b32encode(hashlib.sha256("Hello world!").digest()) 'YBJV4S7CW6P73EZJCMCUG27YREYU4SR7V3AF5T74XN67GGWZ4UNA===='

There you have it: shorter than hex, and easier to work with than base64. If we end up using cryptographic hashes an token unique IDs, I’m pretty sure this is how we’ll do it.

The discussion around using cryptographic hashes as unique identifiers in our models is ongoing. Essentially we need to decide how best to make a unique hash for each token. Please join us in #openscriptures on irc.freenode.net with any input.

Morphological Hebrew Bible Version 1.0

We are pleased to announce the availability of the Open Scriptures Morphological Hebrew Bible version 1.0 on the CrossWire module servers. This new Hebrew Bible module is based on the Open Scriptures MorphHB project and contains Strong’s tagging to allow for easy dictionary lookup. Our current effort is to prepare a system for collaboration on parsing the morphology of the text. If you have some facility with Hebrew and would like to contribute, follow this blog for further announcements. Note that the module in the CrossWire repository follows King James versification, while a separate module in the experimental repository has the original versification of the Leningrad Codex.

Multimedia from BibleTech:2010

The BibleTech:2010 conference was a great time! On Friday night we had the largest birds-of-a-feather session centered around Web APIs for Scripture.

Open Scriptures API: Unified Web Service for Scriptural Linked Data

By Weston Ruter

Video: Vimeo
Slides: Keynote
Audio: MP3

Using Pinax and Django For Collaborative Corpus Linguistics

By James Tauber

Watch video on Vimeo, and see also his talk on A New Kind of Graded Reader.

More audio, slides, and potentially video of additional talks will be posted on the BibleTech speakers page.

Linked Data as the Core Product

There’s been some recent buzz surrounding Bob Pritchett’s talk Network Effects Support Premium Pricing at the O’Reilly TOC Conference. The exciting thing about Bob’s message is that it is already at the core of what Open Scriptures is about. As Sean Boisen quotes from the Huffington Post:

When you purchase a book from [Logos], you’re not just getting a static ebook, you’re buying into a dynamic, integrated online application environment that becomes richer with each new publication, and with each new member to their community.

Further, Antoine Wright writes in The Network Effects of Bible Software:

Wish that MMM could take credit for this line of thought, but really, this is where mobile and web are going. The idea is that the effects of mature networks and platforms are going to turn traditional models of software ownership on its head. Those companies who lead or adapt quickly to this trend will find the business side of the connected economy easier to deal with. Those who wish to lock people into the former model will have a harder time growing marketshare, and might find their content – while the same as a network/platform – diminished in value because it cannot be extended by the user or user communities to draw even more relevance and value from it. Get your networks/platforms/apps ready, things are changing.

So the key offering of Logos Bible Software is not simply a collection of electronic resources, but rather it is the network of resources; that is, their key product is Linked Data, which is more than the sum of its parts. This is why Logos has the edge on all of the free resources out there: they have created links between these free resources, and it is these proprietary links that they sell as their product. Each link from a data point becomes a connection, a bridge that opens it up to a network of related information which adds an immeasurable amount of value to the data. So like Logos, the Open Scriptures project seeks to forge links between resources to create a network of data. Our aim diverges from Logos in that we want to make these links openly available for free under the Creative Commons Share-Alike license, and to provide a standardized API with which to develop applications on top of the linked scriptural data. Our goal is not to destroy Logos’ core product of linked data, but to provide a core subset of linked scriptural data that can be used to power applications of scripture, and to do so embracing the Web as the open platform.