Open Scriptures

“The next Web of open, linked data” for Scripture

I just ran across a relevant and extremely important TED talk by Tim Berners-Lee, inventor of the Internet, regarding “The next Web of open, linked data,” in which he presents the case for Linked Data. Open Scriptures is a Linked Data initiative, seeking to integrate and semantically interlink all of the scriptural data available. Berners-Lee notes in his talk:

The really important thing about data is the more things you have to connect together, the more powerful it is.

This is so key. If you have raw data for two resources, there is only a limited number of ways that this data can be combined together. However, each time you add in another raw data set, the number of combinations grows exponentially. Check out his talk…

Again, Open Scriptures is foundationally about Linked Data for Scripture, granularly interconnecting scriptural texts at the semantic level, and making this data openly accessible. The more links added between texts will result in views and applications of scriptural data never before possible!

Raw Data Now!

Initial Project Writeup

(The following project writeup I did back in October of last year, but it has not been published until now.)

At the BibleTech 2008 conference, James Tauber of MorphGNT identified the need for the wide array of scriptural data to share common references so that they could be integrated and mashed up. However, even if the data we have today shared common references, the ability to integrate this data would be out of reach for the general public. The Composite Gospel Index (CGI) at Semantic Bible, for example, identifies parallel pericopae in the gospels; it makes available XML data consisting of OSIS identifiers which identify and group parallel pericopae. The project includes a view of the data with the text of the RSV. If, however, an Arabic speaker wanted to view the parallel pericopae in an Arabic translation, they would be unable unless they did it by hand or had experience writing applications which parse XML and query a Bible web service (if one even exists for Arabic translations); moreover, if someone else desired to view the CGI in another translation, then the same work done with the Arabic translation would have to be done all over again. Open Scriptures seeks to provide both the unified data repository for serving scripture (such as a NT manuscript or an Arabic translation), the internationalized or language-independent derived scriptural data (such as the CGI), the API to query the data repository, and a hosted application framework which allows work done once for one translation to be immediately available for any other translation.

Open Scriptures is a repository for Biblical manuscripts and their translations, and a system for storing the differences between manuscripts and their relationships to versions expressed by semantic links: it seeks to represent the textual transmission of the Bible and, on top of this foundation:

supply an Open interface for querying interlinked scriptural data,
store derived data in an internationalized (i18n) and translation-neutral manner, and
provide an application platform for mashing this data up into scriptural applications with a framework for discussion and collaboration.

The most fundamental application of Open Scriptures is the comparison between one manuscript and another, between manuscripts (MSS) and their translations, and between the translations themselves. This base application provides new ways to do textual criticism so that in addition to seeing the differences that exist, but to also discover and discuss why there are differences between MSS and translations. This is possible because the MSS are merged into a kind of “unified diff” and because the semantic units in each translation are linked to the semantic units in this unified MS diff from which they were translated. (These semantic-unit links are contributed by users who desire to have a translation added to the system.) At BibleTech 2008, Karl Hofmann set forth (MP3) a vision for just such a need in Bible software, to provide:

Not only what are the differences between texts, but why are there differences? We have to go back and find out the decisions that were made along the way. […] Software must allow us to discuss the text, not just concepts.
[…]
How is it that we are going to be able to get these tools to come together and provide the ability to make distinctions and show where the disagreements lie at a textual level as opposed to at a conceptual level. So instead of talking from my presupposition to your presupposition, we’re actually going to be able to talk about the text.
[…]
A tool that allows us to recognize the distinctions that are already being made, the judgments that are already there, and instead of trying to change your mind from the judgment you’ve made to the judgment I’ve made, to go back and find out from what point, at what point was that judgment made. Where was the distinction made, on what basis was it made? Was it made because I have a presupposition about how the text was read at a particular time? […] or is it made because of some semantic misunderstanding I have? All of the different ways that we can be judging can be brought to light and examined and at least I can make an informed decision…

Thus Open Scriptures seeks to present the texts of the Bible as a social media, and that the functionality provided will inspire collaborative interaction at the textual level.

A second fundamental benefit of having a unified manuscript and having each translation independently link to it is that any scriptural project undertaken using the text from one translation will automatically be available to any other translation that also has its semantic units source-linked. The plethora of Bible translations is causing a fragmentation and isolation of scriptural projects and resources. Countless word studies on any particular word have been independently undertaken, each time perhaps using a different translation or language. If such a word study were to be done on the Open Scriptures framework, multiple users may collaborate on a word study project using their own preferred translation and the results could be viewed in any translation; projects such as cross-references would also especially benefit from this. Scriptural projects become social media developed on a social web.

The applications that are possible on the Open Scriptures framework are exciting; it makes trivial the creation of bilingual editions, interlinear views, full-text searching and exhaustive concordances of words from multiple translations in a language, version-independent word studies and cross references with internationalized expositions.

Open Scriptures is part of the open source movement and will operate not-for-profit; it seeks to make the Scriptures available to the most number of people and provide them with the tools they need to study them and to freely share their findings. Open Scriptures will be built utilizing open web standards and best practices in web development to create an accessible RESTful web service. The state of the Web is such that all of the pieces are coming into place:

Unicode is now ubiquitous and so the text of the Biblical languages can be easily processed and delivered.
The Open Scriptural Information Standard (OSIS) provides the basic XML vocabulary necessary for encoding scriptural texts.
New efforts to photograph and transcribe Biblical manuscripts in electronic format is providing the necessary foundational textual data (see Biblical Manuscripts Project, CSNTM, and The Codex Sinaiticus Project), aside from the numerous manuscripts and resources that are already freely available, such as MorphGNT.
Web browsers are powerful applications which can handle complex web applications employing state-of-the-art technologies like Ajax, SVG, Canvas, and HTML 5.
Cloud computing (as in Google AppEngine, Aptana Cloud, or Amazon EC2) is maturing and provides an architecture to power high-demand processing of complex data structures.
Social media and the Social Web are instilling certain expectations with regard to how we interact on Web 2.0 (as with blogs, Facebook, and Wikipedia).
Google has raised expectations on the capabilities of web applications, their openness, and how data on the web can be mashed up.
Google and Wikipedia have accustomed us to having a central resource for finding information by bringing together diverse fragmented data into an integrated whole.
The Internet is becoming increasingly global and people from every tribe, tongue, and nation are coming onto the Web; the nations need an internationalized service for studying the Scriptures.
Mobile web devices are becoming increasingly popular due to the iPhone, so a web service accessible to mobile devices can satisfy the needs of a rapidly increasing number of mobile users.

Open Scriptures seeks to integrate all of these pieces. For thousands of years scribes copiously copied manuscripts by hand onto vellum or papyrus with a pen and ink. They took great care to ensure that the text was accurately and reverently transferred and made their manuscripts beautifully ornate works of art; they glorified God with the work of their hands. Now today, instead of pen and paper, we have Unicode and HTML; instead of scribes, we have software developers; instead of codices, we have websites. Open Scriptures seeks to apply the same level of skilled craftsmanship in web development as the scribes’ own skilled craftsmanship in the presentation of the Scriptures, all to the glory of God and the edification of His people.

As Zack Hubert said at the conference last year, “It’s a community effort. Any time anything good happens, is because a real cool team of people have come together around an idea.” Open Scriptures seeks to be such a community effort.

Redeeming the Ill-fated Re:Greek Project: a Call for Participation

Update 3: Looking for a website that provides the same core functionality of Zhubert.com? Check out John Dyer’s excellent Reader’s Version of Greek and Hebrew Bible. It’s free!

Update 2: See post on the group regarding how to respond to copyright issues.

Update 1: zhubert.com has been shut down due to the discovery that the MorphGNT has infringed on the German Bible Society‘s copyright of the UBS4 edition of the Greek New Testament. Please do not flame zhubert.com or the German Bible Society or anyone else in the comments here. Each is doing what they feel is right given the situation. We are working on a solution to the copyright issue.

Zack Hubert (former Pastor of Technology at Mars Hill Chruch, now Vice President of Church Community Networks at Zondervan after they aquired The City) spoke (MP3) last year at BibleTech about the history of his free NT Greek web app zhubert.com. After years of development he came to a point in life where he had no time to maintain the project, let alone take it to another level. He shared his desire to further the project and his first plan to do so:

(22:32) What can I do to keep this project continuing since I only had 15 minutes a week to go in and either approve a lexicon entry or make a little bug fix. So I came up with a plan. The plan was the Biblios foundation. (I say past-tense for good reason.) So I started off down the road of creating a non-profit organization. And the idea was that it could accept donations and that I could have this self-supporting organization where I’d draw no income but I could use it to contract out developers… you know find other guys that were right out of college or in seminary or what not and give them $15 an hour or some ridiculously small amount to add on the features and functionality that I think really would have taken zhubert to another level… but I’m not a business man, and yeah… it failed. Quite simply, looking back on it I think people just wanted the software for free. […] (24:25) So development of that just took a break. I was kinda disappointed honestly because I really wanted to see this project go, but there seemed like very few people who were getting behind to help me, so I took a mental break from the project.

He then shared his next plan, to transition his privately-developed closed-source project into an open-source community-driven one called the “Resurgence Greek Project” or simply “Re:Greek”. In an interview (via Wayback Machine since original post inaccessible) he shared:

My vision for Re:Greek is that it would be an Open Source project that could capture the imagination of software developers all around the world. I want to see dozens of developers and designers submitting improvements, new features, localizing Re:Greek into their local language, and innovating on this shared platform. I’ll release more in the near future about how this will work, but it will be a model similar to Linux or Ruby on Rails if you are familiar with those projects.

In another post (also via Wayback Machine since original inaccessible) he was “happy to announce that we have made Re:Greek an Open Source Initiative!” Unfortunately, as you’ll learn if you listen to his BibleTech talk (see above), the initial excitement was extinguished and the project transition failed. People who were used to the old zhubert.com were upset with the new Re:Greek project and just wanted the old site back. There weren’t enough developers and supporters to sustain the Re:Greek open-source project, and so it ceased. I don’t know all of the details, but the project was so thoroughly discontinued that Zack’s announcement blog posts have been removed and the ReGreek Google Group has been deleted. Zack seems to have washed his hands of the Re:Greek project and has thereafter focused all his efforts on The City (which is turning out to be a great success). Such a sad turn of events for such a brilliant idea!

At BibleTech this year, I am presenting the Open Scriptures project as “Picking Up the Mantle of the Re:Greek – Open Source Initiative”. We need to give Zack’s idea a second chance, but we need to learn from his experience. Open Scriptures will also fail if it does not have the collaboration and co-ownership of committed developers and the sponsorship of supporting organizations. However, if the Mozilla Foundation can make Firefox such a success, if the Linux community can persevere to make a solid open-source OS, and if Wikipedia is able to become the global first source for encyclopedia information, then surely there are people of faith—who believe their scriptures are the very words of God—who can make Open Scriptures a reality.

Will you join me?

Join the conversation at the Open Scriptures Google Group and spread the word.

The Tagged Tanakh and Semantic Linking

Update: Please also read James Tauber’s parallel Thoughts on GNT-NET Parallel Glossing Project.

A few days ago I posted to the group regarding the Tagged Tanakh project which, like Open Scriptures, is being presented at BibleTech at the end of this month. I had read their abstract a few months ago and was really interested, but only last week was their project unveiled. A month previously they teased:

Obviously, once the term “Web 2.0” was coined, Web 3.0 couldn’t be far behind. If Web 2.0 is about bringing individuals together via the Internet, then Web 3.0 is about bringing various sources of information together. At this point in time, no one has figured out a popular application of web 3.0 tools, but that doesn’t mean organizations aren’t trying. […]

The Tagged Tanakh is our effort to bring Torah into the Linked Data ecosystem that is emerging around us. Ensuring that our content is accessible and conforms to standardized and open structures is of paramount importance.

Reading this made me realize that Open Scriptures is a “Web 3.0” project and is a Semantic Web initiative. Furthermore, YAVNET pointed to the article “Torah 2.0: Old-Line Publisher Brings Biblical Commentary Into Online World“:

The Jewish Publication Society, a 120-year-old organization devoted to publishing ancient and modern texts on Jewish subjects, has begun work on a project to publish the Jewish Bible, or Tanakh, as an electronic, online text, integrating the original Hebrew with JPS’s English translation and selected commentaries. But the most radical part of the project is an ambitious plan to make the text of the Tanakh into an open platform for users of all stripes to collectively erect their own structure of commentary, debate and interpretation, all linked to the text itself. The publishers hope that the project will radically democratize the ancient process of Talmudic disputation by bringing it into cyberspace.

This is a really exciting project! Notice how similar it sounds to Open Scriptures. The fundamental concept in common is Linked Data. It sounds like they have a plan for linking commentaries and user tags, but that they don’t yet know how their going to link their English translation with the Hebrew text, as the article continues (emphasis added):

Though JPS has a fairly clear concept of how it wants the project to turn out, there are major obstacles to navigate. JPS must figure out how to link across the original Hebrew and the English translation, how to screen users to ensure that their comments are appropriate, how to integrate existing software tools, and how to create an architecture flexible enough to incorporate new and unexpected functions that may not yet even exist.

I’m convinced that the best way to link translations with their source texts is via the granular interlinking of the smallest corresponding semantic units. If the individual semantic units (e.g. words) in two texts are all individually addressable (by being stored individually in a database), then a data structure can be constructed wherein a cluster of semantic units in one text can be linked to the smallest equivalent semantic unit cluster in another text. This data structure has been fleshed out in the Open Scriptures database schema, and the Manuscript Comparator application is powered by manuscripts that have gone through this semantic linking process (though it is a simple example, since all the manuscripts are in one language, Greek).

The concept underlying this semantic linking is the dynamic equivalence method in translation, to quote Wikipedia: “The dynamic (also known as functional equivalence) attempts to convey the thought expressed in a source text (if necessary, at the expense of literalness, original word order, the source text’s grammatical voice, etc.)”. When finding equivalences between two texts in different languages, there are many cases when a single word in one language does not correspond to one single word in the other. For a example:

English: I like to study the Bible.
Spanish: Me gusta estudiar la Biblia.

The Spanish way of expressing “to like” is the idiom “to be pleased by”, so the equivalent of “I like” is “Me gusta” (a two-for-two correspondence). Likewise, to a lesser degree, “to study” and “estudiar” are both infinitives, but Spanish infinitives are formed with a suffix instead of with a particle like English. These sentences serve as a simple example, but also take for example very free translations like The Message which may be so paraphrased that individual words and phrases cannot be linked at all but rather whole sentences and paragraphs: the more free a translation, the longer the smallest common units of meaning; the more literal (word-for-word) a translation is, the shorter the smallest common units of meaning.

Therefore, for there to be a data structure that represents the semantic equivalences between these two sentences, there must be a many-to-many relationship that links individual semantic units for each smallest-unit of corresponding meaning in two texts. In the Open Scriptures database schema, the following entities are employed (see SQL and SchemaGraph):

token is an individual semantic unit (a word),
token_group represents a correspondence of meaning across texts,
token_group_cluster groups tokens from one text and associates them with a token_group,
token_group_cluster_token associates a token with a token_group_cluster (a bit verbose I know)

Now, with regard to the Manuscript Comparator, a shortcut is being taken since mono-lingual manuscripts are being linked together, and because of this, the semantic links are always one-to-one (and so the above schema entities aren’t required) and furthermore the links may be constructed simply with an algorithm which compares normalized strings of Greek text. However, to link texts from different languages, as Open Scriptures and the Tagged Tanakh projects are seeking to do, much more sophisticated techniques will be necessary to create the semantic links. Instead of creating a complex natural language processing algorithms (as Google Translate has done), the most straightforward way of creating the semantic links (in the spirit of Web 2.0) is to utilize collective intelligence. Greek and Hebrew students from around the world could flex their language muscles by making passes over the manuscripts and constructing semantic links with their favorite translations in their native languages. (Imagine if language professors assigned such a task as homework.) Each time someone makes a semantic-linking pass, the semantic links would improve as collective intelligence causes mistakes and bad data to become statistical outliers (and thus be ignored).

As I’ve mentioned previously, an old semantic linker prototype is available which serves as a simple proof of concept for how collective intelligence can interlink texts.

Kyle Biersdorff and I have talked about this a bit, and he wisely pointed out that there needs to be a way of storing semantic links that indicate not just equivalence but also variance. As we’ve discussed here the linking of the Masoretic Text with the LXX, there are many places where the LXX translates Hebrew loosely; it conveys some of the same meaning but also includes additional nuances. Take, for example, Isaiah 7:14 (ESV):

Therefore the Lord himself will give you a sign. Behold, the ____ shall conceive and bear a son, and shall call his name Immanuel.

In the blank, the Hebrew word (עלמה, almah) means “young woman” or “maiden” whereas the LXX word (παρθένος, parthenos) means “virgin”. This is a case where the LXX meaning is more specific than the Hebrew. I’m sure there are instances of the reverse too, where the LXX’s meaning is more generic. (It would be great to compile a list of these semantic variance link types.) In such a case, perhaps there could be an equivalence rating from 0.0 to 1.0, where 1.0 means an exact translation and 0.0 means complete variance; for virgin/maiden, perhaps a rating of 0.75 would be appropriate; when multiple people make passes over the data and provide ratings like this, we could then average them out to get a rating that is validated by collective intelligence.

I’m going to contact the Tagged Tanakh project and see if we can partner together and collaborate to avoid duplicate efforts.

Manuscript Comparator and the Open Scriptures Platform

(Originally posted on my personal blog.)

For the past several weeks all of my free time has gone into building the first application for Open Scriptures. For many months I had been working on designing the database and in December I finally got it to a point where it could store all of the necessary information so that application development could begin. The first application developed is the Manuscript Comparator. This application demonstrates what is possible when the semantic units of individual texts are linked together—when the interrelationships between semantic units are stored in a database and can be queried.

The database is constructed as follows: various manuscripts available on the Web today are each imported into the database individually, storing each manuscript’s word (token) separately with a unique identifier for each. After all of the individual manuscripts have been imported, they are then all merged together into a unified manuscript. The merging algorithm normalizes the text for comparison by removing all casing, diacritics, and punctuation; the unified manuscript stored in the database is composed of these normalized words. So the result of the manuscript merge is a unified manuscript which consists of every possible variant attested to by the contributing manuscripts; furthermore, all of the tokens in an individual manuscript are linked back to their corresponding words in the unified manuscript. Thus every manuscript is linked to every other manuscript by means of their links to a common point, the unified manuscript.

With the database of interlinked manuscripts constructed, the Manuscript Comparator is able to obtain the differences among manuscripts by querying the database for the requested manuscripts and joining them to each other and the unified manuscript. The results are presented in either a parallel (side-by-side) or unified view, with words highlighted according to whether they are “inserted” or “deleted”. (Read the introduction for more information regarding the user interface.) The unified view will serve as the foundation for the upcoming tool which will allow contributors to link the semantic units between manuscripts and translations (see an old prototype), and thus the links between translations via their common links to the unified manuscript. With such semantic links between translations in place, a Translation Comparator application will be possible which compares not the forms of the words in the translations (as is easily done today) but rather one which actually compares the translations based on their manuscript sources. For example, comparing the English King James version with the Spanish Reina Valera version would result in very few differences (if any) since they both rely on the Textus Receptus. Additionally, with the semantic links in place, it will also be able to compute the degree to which any translation relies on one manuscript over another.

The applications possible with this data are really exciting. Open Scriptures aims not only to be a “comprehensive open-source Web repository for integrated scriptural data,” but also “a general application framework for building internationalized social applications of scripture” which present data “in a translation-neutral and internationalized manner so as to be accessible to the community no matter what language they speak or version they prefer.” Inspiration for this framework comes from the Facebook Platform which provides an API enabling web developers to create applications powered by Facebook’s social network data. What if we had a similar platform and framework which enabled web developers to easily build applications which are powered by interlinked scriptural data? What if these applications were hosted on the Cloud as with Google App Engine? These ideas about a scriptural web application platform have really been exciting me, but they haven’t started cooking yet. The ingredients are only just now being gathered… please join me!