Open Scriptures

Bible Organisational System

Most people I know think of the Bible as 66 books broken into the Old Testament of 39 books (Genesis to Malachi) and the New Testament of 27 books (Matthew to Revelation). If only life were so simple! Actually, most of us do actually know that published Bibles don’t always fit into that pattern because the Gideons (here in New Zealand, at least) distribute small pocket-sized Bibles containing the New Testament along with Psalms and Proverbs, so that’s a different combination of books.

But oh, that doesn’t even scratch the surface about how Bibles may be packaged! Luther considered the letter of James to be the “epistle of straw” so he placed it nearer the back of his Bible. So the book order is not consistent. Anyway, the book should really be called Jacob, not James. (It’s Greek ἰάκωβος = Iakobos and German Jakobus.) And by the way, Germans don’t use the names Genesis, Exodus, Leviticus, Numbers, and Deuteronomy (or even the German equivalent of those words)—they call them the first to fifth books of Moses. (And the Hebrews have quite different names for them again, derived from the first word(s) in each of the five books.)

Roman Catholic Bibles typically include a Deutero-canon (or secondary canon, called the Apocrypha by most Protestants) which contains several more books beyond the sixty-six mentioned above.

Some traditions (e.g., John Wycliffe’s English Bible) include the Letter to the Laodiceans in their New Testament.

Older Hebrew Scriptures may contain as few as 22 books, even though it effectively has the same content as the 39 book Old Testament, because they combine things like 1 and 2 Samuel, and the twelve minor prophets, into single books.

Some Bibles omit verses that they say are not in the most ancient manuscripts so they might go directly from verse 20 to verse 22 in a certain chapter. (See if your Bible contains Matthew 17:21, for example.) Other Bibles add extra chapters and verses which they get from the ancient Greek translations of the Hebrew books. And some of those chapters may be labelled with letters (like A, B, and C) rather than numbers.

And talking about chapter and verse numbers, they are not found in the original manuscripts but were added at different points in history. And different people in different cultures divided the texts in different ways. So, for example, in the Psalms in a Hebrew Bible, the text A Psalm of David might be considered verse 1 and what I’ve always thought of as verse 1 is called verse 2.

Modern translations tend to divide the texts into paragraphs, and these breaks (and their corresponding section headings) also might be placed in different locations according to the best judgement of the various translators.

Many modern Bibles contain cross-references, e.g., a quote in John 1 might refer back to Genesis 1. But the code “Gen. 1:1” might look different in another language where they use a different name for the first book of the Hebrew Scriptures, and a different punctuation character for the chapter/verse separator, so it might have “1 Mos. 1.1” in the cross-reference instead. Also, letter suffixes may sometimes be used in Scripture references, like “Mat. 28:3a”.

Most of the hundreds of available Bible software applications have had to address these issues, and most of them have developed their own unique in-house systems to handle them. In many cases, especially for those of us in the often monolingual English-speaking world, the cross-cultural ramifications weren’t really understood at the beginning and so extensions were tacked on later to try to handle these international complications, sometimes leading to systems lacking in elegance or universal application.

I am going to slowly feed my work into the Open Scriptures BibleOrgSys repository over the next few weeks. The essence of my work can be found in XML data files, but I’ve also produced Python scripts that test and exercise (and even export) the XML data to demonstrate how an application might use the data files.

So this work is my attempt to develop a system that is multilingual, multinational, and multicultural from the beginning—to pull all of these various Bible organisational systems into one place. And although I’m an English-speaking Protestant Christian, it tries not to assume that every Bible-related publication in the world is based on any of those particular characteristics. And then it is made freely available to be used by others, and even to be extended by others when areas I didn’t cover adequately are discovered.

For those with experience in this area, I look forward to your comments, and even better still, suggested improvements. I will label the submitted files V0.5 to suggest that I might have done half of the research and work in setting this up, but there’s plenty of room for your input and wider experience to take us to V1.0. You can view my work on Bible books codes at https://github.com/openscriptures/BibleOrgSys.

BibleTech:2011 Talk: “Distributed model for interlinked text development”

Here is the talk proposal I submitted for presenting at BibleTech:2011 which is going to be March 25-26 in Seattle:

Open Scriptures API: Distributed model for interlinked text development

In the continuing efforts at Open Scriptures to produce an open platform for the development of scriptural data and its applications, work continues on the most fundamental layer: the representation of the scriptural text. This talk will look at a normalized way to store text in a relational database as tokens and structures, and also how such a text can be versioned as it undergoes continuous improvements (e.g. making a translation or creating a critical text); as the text can exist in multiple revisions/editions, it can also be branched so as to introduce variant or alternate readings; it can also be forked to introduce a new derivative work linked back to the original. (Much of this will be applying lessons from the Git distributed version control system.) In addition to links between text editions, the representation of links between different texts (interlinearization) will be examined at the level a single scripture server and also in the context of a distributed network of scripture servers which all have texts that are potentially undergoing development and how the interconnections between the texts can be maintained to enable scriptural linked data.

This is a continuation of last year’s talk, Open Scriptures API: Unified Web Service for Scriptural Linked Data.

Cryptographic hashes and RESTful URIs

In a recent post to the Open Scriptures mailing list, it was suggested that we use md5 (or another cryptographic hash) to generate unique IDs for each token (a “token” is the fundamental unit of text (most often a word) in our API database models). Today we discussed the implementation of this on IRC, and it was fairly stimulating.

First of all, md5 is broken and deprecated, due to possible collisions (two different pieces of data can result in the same hash). Since we will be dealing with millions of tokens, we decided not to test our luck, unlikely though a problem may be. SHA-256 has no known collisions, so we decided it was best to use that algorithm.

SHA-256 is implemented in Python’s standard library hashlib, so that is good. For exapmle:

>>> import hashlib >>> hashlib.sha256("Hello world!").digest() '\xc0S^K\xe2\xb7\x9f\xfd\x93)\x13\x05Ck\xf8\x891NJ?\xae\xc0^\xcf\xfc\xbb}\xf3\x1a\xd9\xe5\x1a'

Needless to say, such a digest would not be very good for use in a RESTful URI scheme. So, hashlib also offers a hexadecimal option:

>>> hashlib.sha256("Hello world!").hexdigest() 'c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a'

That is still not the best, since that makes for a very long string. So, we have the option of using base64 encoding:

>>> import base64 >>> base64.b64encode(hashlib.sha256("Hello world!").digest()) 'wFNeS+K3n/2TKRMFQ2v4iTFOSj+uwF7P/Lt98xrZ5Ro='

That is shorter, but it includes the “/” character, which is a no-no for URI design. Luckily base64 includes a function for this exact purpose:

>>> base64.urlsafe_b64encode(hashlib.sha256("Hello world!").digest()) 'wFNeS-K3n_2TKRMFQ2v4iTFOSj-uwF7P_Lt98xrZ5Ro='

So that is safe for URIs, but being case-sensitive and having ambiguous characters makes it not the best for working with. So, base32 to the rescue:

>>> base64.b32encode(hashlib.sha256("Hello world!").digest()) 'YBJV4S7CW6P73EZJCMCUG27YREYU4SR7V3AF5T74XN67GGWZ4UNA===='

There you have it: shorter than hex, and easier to work with than base64. If we end up using cryptographic hashes an token unique IDs, I’m pretty sure this is how we’ll do it.

The discussion around using cryptographic hashes as unique identifiers in our models is ongoing. Essentially we need to decide how best to make a unique hash for each token. Please join us in #openscriptures on irc.freenode.net with any input.

Morphological Hebrew Bible Version 1.0

We are pleased to announce the availability of the Open Scriptures Morphological Hebrew Bible version 1.0 on the CrossWire module servers. This new Hebrew Bible module is based on the Open Scriptures MorphHB project and contains Strong’s tagging to allow for easy dictionary lookup. Our current effort is to prepare a system for collaboration on parsing the morphology of the text. If you have some facility with Hebrew and would like to contribute, follow this blog for further announcements. Note that the module in the CrossWire repository follows King James versification, while a separate module in the experimental repository has the original versification of the Leningrad Codex.