The Tagged Tanakh and Semantic Linking

Update: Please also read James Tauber’s parallel Thoughts on GNT-NET Parallel Glossing Project.

A few days ago I posted to the group regarding the Tagged Tanakh project which, like Open Scriptures, is being presented at BibleTech at the end of this month. I had read their abstract a few months ago and was really interested, but only last week was their project unveiled. A month previously they teased:

Obviously, once the term “Web 2.0” was coined, Web 3.0 couldn’t be far behind. If Web 2.0 is about bringing individuals together via the Internet, then Web 3.0 is about bringing various sources of information together. At this point in time, no one has figured out a popular application of web 3.0 tools, but that doesn’t mean organizations aren’t trying. […]

The Tagged Tanakh is our effort to bring Torah into the Linked Data ecosystem that is emerging around us. Ensuring that our content is accessible and conforms to standardized and open structures is of paramount importance.

Reading this made me realize that Open Scriptures is a “Web 3.0” project and is a Semantic Web initiative. Furthermore, YAVNET pointed to the article “Torah 2.0: Old-Line Publisher Brings Biblical Commentary Into Online World“:

The Jewish Publication Society, a 120-year-old organization devoted to publishing ancient and modern texts on Jewish subjects, has begun work on a project to publish the Jewish Bible, or Tanakh, as an electronic, online text, integrating the original Hebrew with JPS’s English translation and selected commentaries. But the most radical part of the project is an ambitious plan to make the text of the Tanakh into an open platform for users of all stripes to collectively erect their own structure of commentary, debate and interpretation, all linked to the text itself. The publishers hope that the project will radically democratize the ancient process of Talmudic disputation by bringing it into cyberspace.

This is a really exciting project! Notice how similar it sounds to Open Scriptures. The fundamental concept in common is Linked Data. It sounds like they have a plan for linking commentaries and user tags, but that they don’t yet know how their going to link their English translation with the Hebrew text, as the article continues (emphasis added):

Though JPS has a fairly clear concept of how it wants the project to turn out, there are major obstacles to navigate. JPS must figure out how to link across the original Hebrew and the English translation, how to screen users to ensure that their comments are appropriate, how to integrate existing software tools, and how to create an architecture flexible enough to incorporate new and unexpected functions that may not yet even exist.

I’m convinced that the best way to link translations with their source texts is via the granular interlinking of the smallest corresponding semantic units. If the individual semantic units (e.g. words) in two texts are all individually addressable (by being stored individually in a database), then a data structure can be constructed wherein a cluster of semantic units in one text can be linked to the smallest equivalent semantic unit cluster in another text. This data structure has been fleshed out in the Open Scriptures database schema, and the Manuscript Comparator application is powered by manuscripts that have gone through this semantic linking process (though it is a simple example, since all the manuscripts are in one language, Greek).

The concept underlying this semantic linking is the dynamic equivalence method in translation, to quote Wikipedia: “The dynamic (also known as functional equivalence) attempts to convey the thought expressed in a source text (if necessary, at the expense of literalness, original word order, the source text’s grammatical voice, etc.)”. When finding equivalences between two texts in different languages, there are many cases when a single word in one language does not correspond to one single word in the other. For a example:

English: I like to study the Bible.
Spanish: Me gusta estudiar la Biblia.

The Spanish way of expressing “to like” is the idiom “to be pleased by”, so the equivalent of “I like” is “Me gusta” (a two-for-two correspondence). Likewise, to a lesser degree, “to study” and “estudiar” are both infinitives, but Spanish infinitives are formed with a suffix instead of with a particle like English. These sentences serve as a simple example, but also take for example very free translations like The Message which may be so paraphrased that individual words and phrases cannot be linked at all but rather whole sentences and paragraphs: the more free a translation, the longer the smallest common units of meaning; the more literal (word-for-word) a translation is, the shorter the smallest common units of meaning.

Therefore, for there to be a data structure that represents the semantic equivalences between these two sentences, there must be a many-to-many relationship that links individual semantic units for each smallest-unit of corresponding meaning in two texts. In the Open Scriptures database schema, the following entities are employed (see SQL and SchemaGraph):

token is an individual semantic unit (a word),
token_group represents a correspondence of meaning across texts,
token_group_cluster groups tokens from one text and associates them with a token_group,
token_group_cluster_token associates a token with a token_group_cluster (a bit verbose I know)

Now, with regard to the Manuscript Comparator, a shortcut is being taken since mono-lingual manuscripts are being linked together, and because of this, the semantic links are always one-to-one (and so the above schema entities aren’t required) and furthermore the links may be constructed simply with an algorithm which compares normalized strings of Greek text. However, to link texts from different languages, as Open Scriptures and the Tagged Tanakh projects are seeking to do, much more sophisticated techniques will be necessary to create the semantic links. Instead of creating a complex natural language processing algorithms (as Google Translate has done), the most straightforward way of creating the semantic links (in the spirit of Web 2.0) is to utilize collective intelligence. Greek and Hebrew students from around the world could flex their language muscles by making passes over the manuscripts and constructing semantic links with their favorite translations in their native languages. (Imagine if language professors assigned such a task as homework.) Each time someone makes a semantic-linking pass, the semantic links would improve as collective intelligence causes mistakes and bad data to become statistical outliers (and thus be ignored).

As I’ve mentioned previously, an old semantic linker prototype is available which serves as a simple proof of concept for how collective intelligence can interlink texts.

Kyle Biersdorff and I have talked about this a bit, and he wisely pointed out that there needs to be a way of storing semantic links that indicate not just equivalence but also variance. As we’ve discussed here the linking of the Masoretic Text with the LXX, there are many places where the LXX translates Hebrew loosely; it conveys some of the same meaning but also includes additional nuances. Take, for example, Isaiah 7:14 (ESV):

Therefore the Lord himself will give you a sign. Behold, the ____ shall conceive and bear a son, and shall call his name Immanuel.

In the blank, the Hebrew word (עלמה, almah) means “young woman” or “maiden” whereas the LXX word (παρθένος, parthenos) means “virgin”. This is a case where the LXX meaning is more specific than the Hebrew. I’m sure there are instances of the reverse too, where the LXX’s meaning is more generic. (It would be great to compile a list of these semantic variance link types.) In such a case, perhaps there could be an equivalence rating from 0.0 to 1.0, where 1.0 means an exact translation and 0.0 means complete variance; for virgin/maiden, perhaps a rating of 0.75 would be appropriate; when multiple people make passes over the data and provide ratings like this, we could then average them out to get a rating that is validated by collective intelligence.

I’m going to contact the Tagged Tanakh project and see if we can partner together and collaborate to avoid duplicate efforts.

Comments

Comments are closed.