Scripture and database models
At the moment the Open Scriptures project is working on developing an internet API for querying scriptural text and metadata. The basic task is to create “a common API for many datasets.” However, before the API can be implemented, the underlying relational database models must be established. To that end, Weston has been working on implementing database models for the API using Django (the development platform for Open Scriptures).
One of the most challenging aspects has been finding out how to record structural information about the text: verses, chapters, title headings, etc. There’s also been a desire to not rely on any particular structural marker in the database’s organization. So the base unit for storing the text is not a chapter or verse, but what is called a “token”. A token is comprised by one of the three atomic structures of a text – word; punctuation; whitespace. Of course, there may be cases where even the basic token can be split, but you’ve got to start somewhere.
To provide structure, Weston has written a token linkage system, where you can define a certain structure (e.g. “Verse 12″) and, using the features of a relational database, connect it to the tokens which should be included in that structure. There is even a feature for non-linear token linkages, if anyone finds a use for that.
Another piece of the puzzle is deciding how to express various types of metadata about the text in the database. One important type of metadata is the parsing information for Greek or Hebrew tokens. That parsing information could be provided in a simple string (e.g. “verb PAI3S”) which client applications would then have to interpret based on established conventions. This is not ideal, however, since it would seriously hamper the querying power of the API. It is best to instead use a database model for parsings. The challenge here comes in supporting multiple languages. Once again, relational database features will assist with this problem, since we can assign one Greek or Hebrew (or any other language) parsing to each token’s metadata. If there is a difference of opinion on a parsing, we can even store multiple parsings for each token.
I am optimistic about the potential of this project. Once the API is nailed down, there will be a lot of great opportunities for “client” apps, using whatever framework they wish. Until then, the API has to be finalized and garnished with built-in methods, and the models have to be tested with real data (which requires that the data be ported to the models in the first place). This is where we can use help from all sorts of people, from Python programmers to database experts to linguists and biblical scholars. It’s a good time to be interested in the scriptures and open source software.
“An Economic Argument for Free Primary Data”
Our colleague Efraim Feinstein at the Open Siddur Project wrote an excellent blog post on “An Economic Argument for Free Primary Data”. Here’re the introductory paragraphs:
There are two principles on which the success of data on the contemporary web rests: the web makes content available, and it adds value to that content by linking it to other related information.
When considering bringing old content online, both of these aspects are important. A first level of digitization involves simply making data available. Google Books and Hebrewbooks.org work at this level, providing PDFs and/or OCR-ed transcriptions of the material. A second level of digitization involves semantic linkage of the data, both internal to the site and external to the site. The Open Siddur Project, Tagged Tanakh and Open Scriptures digitize at the semantic level. This second-level digitization is required to do all of the cool things we expect to be able to do with online texts: click on a word and find its definition or grammatical form, find the source of a passage in one text in another text, find how the text has evolved historically, etc. Even the simplest form of a link: a reference from another site, requires some kind of internal division.
Digitization that takes advantage of the web therefore requires a number of steps: (1) getting the basic text online, (2) getting it in an addressable form (to make it more like typed text, instead of a picture of a page), (3) assuring the text’s accuracy, and (4) marking it up for semantic linkage. Some of these steps, or parts of them can be done automatically, but, overall, they require some degree of intelligent input. Even step 1, which is primarily mechanical in nature, requires design of the procedures.
I hope that this outline of the required steps to getting a text online suggests that the most expensive part of making content available is human labor — it takes time to do it, and it takes even more time to do it right.
Continue reading the rest of the post!
Stand-off Markup in OSIS
I have been researching stand-off markup, and discussing (and discussing) on osis-users about its applicability as a normalized way in OSIS to represent scripture in a single consistent XML structure. I was initially introduced to stand-off markup by Efraim Feinstein of the Open Siddur project and then further by James Tauber (1, 2, 3, 4, 5). I just came across the following quote in TEI P5: Non-hierarchical Structures:
It has been noted that stand-off markup has several advantages over embedded annotations. In particular, it is possible to produce annotations of a text even when the source document is read-only. Furthermore, annotation files can be distributed without distributing the source text. Further advantages mentioned in the literature are that discontinuous segments of text can be combined in a single annotation, that independent parallel coders can produce independent annotations, and that different annotation files can contain different layers of information. Lastly, it has also been noted that this approach is elegant.
But there are also several drawbacks. First, new stand-off annotated layers require a separate interpretation, and the layers — although separate — depend on each other. Moreover, although all of the information of the multiple hierarchies is included, the information may be difficult to access using generic methods.
In the current OSIS schema, one structure (e.g. verse, quote, or paragraph) has to be chosen as primary, leaving the others to be represented with milestoned elements (actually, only verses can currently be chosen as primary); stand-off markup would allow overlapping hierarchies of verses, quotes, and paragraphs to all be on equal footing without having to choose one as primary. Is this good for OSIS? Please join in on the discussions at osis-users!
Upcoming Talks at BibleTech:2010
A couple of us are presenting at the BibleTech:2010 conference in San Jose, California. Weston Ruter is presenting:
Open Scriptures API: Unified Web Service for Scriptural Linked Data
The OSIS XML standard provides for a lot of free variation in the way it represents scriptural constructs (such as verse boundaries). Because of this, different OSIS documents encoding the same work may have vastly different DOM trees, which make automated traversal of arbitrary OSIS documents very difficult. Aside from this fact, the DOM is not a very programmer-friendly way to query for scriptural data to begin with. In this era of web services and mashups, having a standard, unified way to access scriptural data is a prerequisite for scriptural applications to take off in the same way that applications based on other common datasets have (such as maps). Furthermore, these scriptural datasets should be all explicitly interconnected as Linked Data of the Semantic Web, so that any metadata attached to a word in one translation would also be available to any other translation or manuscript by means of their interconnections. So while OSIS XML is “a common format for many visions”, this talk will explore “a common API for many datasets”; this will be a continuation of BibleTech:2009’s talk: “Open Scriptures: Picking Up the Mantle of the Re:Greek – Open Source Initiative”.
James Tauber is presenting, in addition to “A New Kind of Graded Reader”:
Using Pinax and Django For Collaborative Corpus Linguistics
Django is a popular, Python-based Web framework. Pinax is a platform for rapidly building sites on top of Django, particular sites with a strong collaborative focus.
After introducing Django and Pinax, we will discuss Pinax-based tools the speaker is developing to help with web-based collaboration on corpus annotation with applications from lexicography to morphology to syntax to discourse analysis.
And there are many other fascinating talks lined up. Hope to see you there!