Uncategorized | Open Scriptures

“An Economic Argument for Free Primary Data”

Our colleague Efraim Feinstein at the Open Siddur Project wrote an excellent blog post on “An Economic Argument for Free Primary Data”. Here’re the introductory paragraphs:

There are two principles on which the success of data on the contemporary web rests: the web makes content available, and it adds value to that content by linking it to other related information.

When considering bringing old content online, both of these aspects are important. A first level of digitization involves simply making data available. Google Books and Hebrewbooks.org work at this level, providing PDFs and/or OCR-ed transcriptions of the material. A second level of digitization involves semantic linkage of the data, both internal to the site and external to the site. The Open Siddur Project, Tagged Tanakh and Open Scriptures digitize at the semantic level. This second-level digitization is required to do all of the cool things we expect to be able to do with online texts: click on a word and find its definition or grammatical form, find the source of a passage in one text in another text, find how the text has evolved historically, etc. Even the simplest form of a link: a reference from another site, requires some kind of internal division.

Digitization that takes advantage of the web therefore requires a number of steps: (1) getting the basic text online, (2) getting it in an addressable form (to make it more like typed text, instead of a picture of a page), (3) assuring the text’s accuracy, and (4) marking it up for semantic linkage. Some of these steps, or parts of them can be done automatically, but, overall, they require some degree of intelligent input. Even step 1, which is primarily mechanical in nature, requires design of the procedures.

I hope that this outline of the required steps to getting a text online suggests that the most expensive part of making content available is human labor — it takes time to do it, and it takes even more time to do it right.

Continue reading the rest of the post!

Stand-off Markup in OSIS

I have been researching stand-off markup, and discussing (and discussing) on osis-users about its applicability as a normalized way in OSIS to represent scripture in a single consistent XML structure. I was initially introduced to stand-off markup by Efraim Feinstein of the Open Siddur project and then further by James Tauber (1, 2, 3, 4, 5). I just came across the following quote in TEI P5: Non-hierarchical Structures:

It has been noted that stand-off markup has several advantages over embedded annotations. In particular, it is possible to produce annotations of a text even when the source document is read-only. Furthermore, annotation files can be distributed without distributing the source text. Further advantages mentioned in the literature are that discontinuous segments of text can be combined in a single annotation, that independent parallel coders can produce independent annotations, and that different annotation files can contain different layers of information. Lastly, it has also been noted that this approach is elegant.

But there are also several drawbacks. First, new stand-off annotated layers require a separate interpretation, and the layers — although separate — depend on each other. Moreover, although all of the information of the multiple hierarchies is included, the information may be difficult to access using generic methods.

In the current OSIS schema, one structure (e.g. verse, quote, or paragraph) has to be chosen as primary, leaving the others to be represented with milestoned elements (actually, only verses can currently be chosen as primary); stand-off markup would allow overlapping hierarchies of verses, quotes, and paragraphs to all be on equal footing without having to choose one as primary. Is this good for OSIS? Please join in on the discussions at osis-users!

Upcoming Talks at BibleTech:2010

A couple of us are presenting at the BibleTech:2010 conference in San Jose, California. Weston Ruter is presenting:

Open Scriptures API: Unified Web Service for Scriptural Linked Data

The OSIS XML standard provides for a lot of free variation in the way it represents scriptural constructs (such as verse boundaries). Because of this, different OSIS documents encoding the same work may have vastly different DOM trees, which make automated traversal of arbitrary OSIS documents very difficult. Aside from this fact, the DOM is not a very programmer-friendly way to query for scriptural data to begin with. In this era of web services and mashups, having a standard, unified way to access scriptural data is a prerequisite for scriptural applications to take off in the same way that applications based on other common datasets have (such as maps). Furthermore, these scriptural datasets should be all explicitly interconnected as Linked Data of the Semantic Web, so that any metadata attached to a word in one translation would also be available to any other translation or manuscript by means of their interconnections. So while OSIS XML is “a common format for many visions”, this talk will explore “a common API for many datasets”; this will be a continuation of BibleTech:2009’s talk: “Open Scriptures: Picking Up the Mantle of the Re:Greek – Open Source Initiative”.

James Tauber is presenting, in addition to “A New Kind of Graded Reader”:

Using Pinax and Django For Collaborative Corpus Linguistics

Django is a popular, Python-based Web framework. Pinax is a platform for rapidly building sites on top of Django, particular sites with a strong collaborative focus.

After introducing Django and Pinax, we will discuss Pinax-based tools the speaker is developing to help with web-based collaboration on corpus annotation with applications from lexicography to morphology to syntax to discourse analysis.

And there are many other fascinating talks lined up. Hope to see you there!

What Good Is Linked Data?

Note: This is a conceptual overview, for a technical look, see here or here.

To follow up my previous post concerning raw data I thought it would be good to give a discussion to linked data. First of all, it must be emphasized that linked data cannot exist if there is not access to raw data. So “raw data now,” then linked data.

This whole notion of linked data is really the idea of making data useful, really useful. At a high level, a good example of linked data is Wikipedia. In particular, take a look at this article about BSD. As I type that sentence, I realize that many do not know that BSD stands for Berkeley Software Distribution. Nor would many others guess that BSD happens to be the precursor to many flavors of operating systems, among them FreeBSD, NetBSD, MAC OS X, DragonFlyBSD, etc. The point that I want to extract from the Wikipedia article is that there is a plethora of information in the article but there is also a plethora of links that one can access through the article. Thus, if one wanted to learn about FreeBSD, the Wikipedia BSD article already has a link to it. Further, one could read the FreeBSD page and find a nice graphical derivative, PC-BSD. Without the basic implementation of links, these correlations would be much more difficult to come by.

On the internet, linking is the way to go. If we zoom in a little bit, we may notice some interesting features of linking. Let’s stick with Wikipedia, their main page boasts 27 different languages, impressive. Now, go ahead and return to our BSD article and on the bottom left select another language. Now you have the same information, yet in a completely different language. The data on the German page and the data on the English page should conceptually be the same information, yet because it is presented in those two different languages the article is now much more useful to many more people. Now multiply that by 27 and it is very easy to see why Wikipedia has gained incredible worldwide appreciation. How many languages can you get Encyclopedia Britannica in?

Alright, so those examples deal mainly with information in the form that we are used to seeing online, web pages. What happens when we take a look at data itself? Tim Berners-Lee uses census data as an example in his TED talk, but I thought it would be more interesting to look into Scriptural data. In the field of Biblical Studies we have a lot of manuscripts. What we don’t have is a lot of easy access to those manuscripts nor easy methods to compare. However, that is changing! As more of these manuscripts become available online (see these projects) we have the ability to link them together. The Manuscript Comparator is a prototype of this linkage. What the prototype accomplishes is systematically linking the data found in the manuscripts for simplified and complete comparison. Sure, someone could get hard copies of each manuscript and manually compare them. But anyone who has done ancient language study will surely appreciate the beauty and simplicity of this application. To simply type in the passage that one is studying and then be able to easily view discrepancies is a huge resource! Not only that but it demonstrates the power of linked data.

This is only the beginning for Biblical Studies, if you want to see what the collective mind of Open Scriptures dreams about when we consider linked data, check out the Potential Applications page.

Who Cares About Raw Data?

One of the core ideas upon which Open Scriptures is based is open access to raw data. This concept was introduced (not initially, but widely) by Tim Berners-Lee in his TED talk, “Lee on the next Web.” The recurring phrase throughout this talk is “raw data now.” Coupled with this idea is the notion of linked data, sometimes called “Web 3.0.” I here set out to explain why these concepts matter to Open Scriptures.

So what is open access to raw data and who really cares? To the average user of the internet raw data is both trivial and essential. It is trivial mainly because raw data by itself is not terribly interesting or useful. However, raw data is absolutely essential because it is what drives the most popular websites in the world. The key is how the raw data is linked together.

A very good analogy is that of a research paper. When one sets out to write a detailed research paper, a first step is to collect information. Often this is a very lengthy process, involving many hours online and in the library reading articles, books, and anything that pertains to the paper topic. A common technique for keeping track of all of this information during this stage used to be 3 X 5 cards, but I think it is safe to say that there are computer programs that do a much better job today, e.g. Zotero. Once this information gathering phase is finished, the writer has a formidable amount of raw data. Yet, as mentioned above, this raw data is not particularly useful. If the writer were to simply submit all of these separate pieces of information to the publisher/teacher/newspaper the paper would clearly be rejected. The reason: raw data needs to be linked in meaningful ways.

This is where the second part of the writing process comes into play, actually writing. The author takes all of the raw data that was collected and he or she sets out to tie it all together into a meaningful piece of literature. Ideally, the finished product will contain most of the raw data but the paper will clearly demonstrate how each piece of information is related to the others and, perhaps most importantly, how each piece of information supports the writer’s thesis statement.

To the point, raw data is the essential first step in the process of presenting information in meaningful and helpful ways. Thus, even though most web users do not seem to care about raw data, in reality, they actually care a great deal. Content providers need to put their raw data online in a way that is accessible to developers so that they can do their job creating applications that make the data useful for the rest of the world.

Open Scriptures is committed to fostering the development of raw data on the internet so that developers will have access to the data that they need to create great web applications! For an example of how raw data (manuscripts) may be linked together to create helpful web applications, see our Manuscript Comparator.

This only scratches the surface. There is much more to raw data and especially to linked data than what is presented here. For more information see http://www.w3.org/DesignIssues/LinkedData.html and look forward to another post detailing linked data.