At the moment the Open Scriptures project is working on developing an internet API for querying scriptural text and metadata. The basic task is to create “a common API for many datasets.” However, before the API can be implemented, the underlying relational database models must be established. To that end, Weston has been working on implementing database models for the API using Django (the development platform for Open Scriptures).
One of the most challenging aspects has been finding out how to record structural information about the text: verses, chapters, title headings, etc. There’s also been a desire to not rely on any particular structural marker in the database’s organization. So the base unit for storing the text is not a chapter or verse, but what is called a “token”. A token is comprised by one of the three atomic structures of a text – word; punctuation; whitespace. Of course, there may be cases where even the basic token can be split, but you’ve got to start somewhere.
To provide structure, Weston has written a token linkage system, where you can define a certain structure (e.g. “Verse 12″) and, using the features of a relational database, connect it to the tokens which should be included in that structure. There is even a feature for non-linear token linkages, if anyone finds a use for that.
Another piece of the puzzle is deciding how to express various types of metadata about the text in the database. One important type of metadata is the parsing information for Greek or Hebrew tokens. That parsing information could be provided in a simple string (e.g. “verb PAI3S”) which client applications would then have to interpret based on established conventions. This is not ideal, however, since it would seriously hamper the querying power of the API. It is best to instead use a database model for parsings. The challenge here comes in supporting multiple languages. Once again, relational database features will assist with this problem, since we can assign one Greek or Hebrew (or any other language) parsing to each token’s metadata. If there is a difference of opinion on a parsing, we can even store multiple parsings for each token.
I am optimistic about the potential of this project. Once the API is nailed down, there will be a lot of great opportunities for “client” apps, using whatever framework they wish. Until then, the API has to be finalized and garnished with built-in methods, and the models have to be tested with real data (which requires that the data be ported to the models in the first place). This is where we can use help from all sorts of people, from Python programmers to database experts to linguists and biblical scholars. It’s a good time to be interested in the scriptures and open source software.
Our colleague Efraim Feinstein at the Open Siddur Project wrote an excellent blog post on “An Economic Argument for Free Primary Data”. Here’re the introductory paragraphs:
There are two principles on which the success of data on the contemporary web rests: the web makes content available, and it adds value to that content by linking it to other related information.
When considering bringing old content online, both of these aspects are important. A first level of digitization involves simply making data available. Google Books and Hebrewbooks.org work at this level, providing PDFs and/or OCR-ed transcriptions of the material. A second level of digitization involves semantic linkage of the data, both internal to the site and external to the site. The Open Siddur Project, Tagged Tanakh and Open Scriptures digitize at the semantic level. This second-level digitization is required to do all of the cool things we expect to be able to do with online texts: click on a word and find its definition or grammatical form, find the source of a passage in one text in another text, find how the text has evolved historically, etc. Even the simplest form of a link: a reference from another site, requires some kind of internal division.
Digitization that takes advantage of the web therefore requires a number of steps: (1) getting the basic text online, (2) getting it in an addressable form (to make it more like typed text, instead of a picture of a page), (3) assuring the text’s accuracy, and (4) marking it up for semantic linkage. Some of these steps, or parts of them can be done automatically, but, overall, they require some degree of intelligent input. Even step 1, which is primarily mechanical in nature, requires design of the procedures.
I hope that this outline of the required steps to getting a text online suggests that the most expensive part of making content available is human labor — it takes time to do it, and it takes even more time to do it right.
Continue reading the rest of the post!