Unified Standard Format Markers are a system of short alphanumeric markers (which follow backslashes in text files) used by UBS and SIL to encode Bible translations. Each book is usually stored in a separate file. (In floppy disk days, it was each chapter.) USFM is mostly used by the Paratext editor for Bible translations in progress.
The XML data file and scripts uploaded today are the simple beginnings of a framework to allow access to folders containing USFM texts. USFMMarkers.py loads and converts the XML file containing USFM codes, and provides some simple look-up functions. USFMFilenames.py accepts a folder name and searches for USFM files in that folder. All very simple, but hopefully another small step forwards.
Well now we’re into a little bit of fun stuff — trying to work out how to handle the user inputs to select a certain book out of a certain publication. For each language, we have a XML data file which specifies the defaultName and defaultAbbreviation for all the books, e.g., Genesis (Gen), Jude (Jde), etc.
We can use those fields for display and for keyboard input, and we have included a simple routine to create a list of all unambiguous shortcuts, e.g., we could accept Ge and also know that the user meant Genesis because no other (English) Bible book starts with Ge. But we also would like to accept Gn. So we add an extra inputAbbreviation field, where we can add things like Gnss. Again, the program will create a list of unambiguous shortcuts from this information (which would include Gns and Gn).
But why not also automatically include G as a shortcut for Genesis? Well, the problem is that G could equally stand for Galatians. But if we specified that our particular publication only contains Old Testament books, then this additional piece of information would allow the software to automatically include G as an unambiguous shortcut for Genesis. And of course, you could override by entering G as an inputAbbreviation for Genesis anyway if you desire, even for a complete Bible publication.
Ok, that’s fairly simple then. But what about 1 Timothy? It would be nice to accept 1Tim (without a space), or I Tim (using Roman numerals), 1st Tim, etc. Well the information to do this is also put in the XML file under the section BibleBooknameLeaders. This specifies everything that you will accept instead of the 1, e.g., I, First, One, etc. The software automatically handles all the various combinations for you — with and without the intervening spaces.
And then the third and final (but first in the file) section of the XML file is for BibleDivisionNames. You might want to limit a search, for example, to the Old Testament or the Pentateuch. So in this section we can specify our division names and abbreviations, and a list of the books which would be included in each division.
The XML filename includes the ISO 639-3 language code. But it also includes a qualifier, so you could have something like eng_traditional and eng_modern. Why? Well, what if you wanted to call the final book of your publication Vision instead of Revelation? It’s still English but a different system and we would need to know how you want that different bookname displayed. (Although that example is fictitious, in the Philippines where I worked it’s actually common to nickname Bible versions by the different names that the translators gave to the book of Revelation.)
So look through the data files at https://github.com/openscriptures/BibleOrgSys/tree/master/DataFiles/BookNames and tell me what other pieces of information I should still have in there. Note that I didn’t include extended book names (like The Second Epistle of the Apostle Paul to the Church at Corinth) as it seems to me that they only need to be specified in the actual Biblical material itself. I took a stab at creating basic files for French and German also, even though I don’t speak those languages, so I’m sure they’ll need some fixing. Note also that specifying the books included in a division (such as Old Testament or Pauline Letters) here does seem a little out of place (maybe it would seem more logical in the book orders data files), but since that kind of division information often relates to the cultural heritage, I’ve accepted it for now as a reasonable compromise.
There’s also some basic Python3 code to handle these data files at https://github.com/openscriptures/BibleOrgSys/tree/master/BibleBooksNames.py. It runs a brief demo so check it out and send me your suggestions and improvements.
Why do we need to specify the punctuation system for a Bible? Isn’t that sort of information already set on my computer in my locale? Well yes, the common punctuation conventions for your language might be defined in your locale, but there’s three good reasons why we need to define our own system here:
- We want to handle not just our own language here, but also the languages of the original texts, perhaps including Hebrew, Aramaic and Greek, plus the languages of other translations, e.g., Latin.
- Even within one language, there might be differences. For example, many of the original texts didn’t have any punctuation (or even word breaks). So we want to be able to specify the punctuation for any publication.
- Our standard computer locales are not likely to include the punctuation standard for Bible references and other Bible-specific uses.
Today I’ve committed the first attempt at a simple XML file for specifying the punctuation for a Bible publication. I’ve labelled my attempts as V0.40 because I’m sure that there’s going to be many more fields that we’ll need to add. And for that reason also, I haven’t yet tried to make data files for a large range of languages.
As we develop the system further, we’ll be needing ways to tell a computer program how to parse a digital text, i.e., how to determine all the different parts of the text. Another of my specific interests is also checking Bible files to make sure they are consistent ready for online or print publishing. So that requires specifying even more punctuation and formatting type details.
So as already mentioned, the current files at https://github.com/openscriptures/BibleOrgSys/tree/master/DataFiles/PunctuationSystems are really only just a starting point to work from. Let me know what else we’ll need to add.
At some point, a Bible organisational system has to define both the actual books which are included in the publication, and the order in which they are presented. (As I pointed out in my introductory blog, both of these factors can vary considerably across cultures.)
I have found it better to separate this information from the versification information (about where chapter and verse breaks come). So the book order files are very simple — basically just an ordered list of book codes, e.g., GEN, EXO, … REV or whatever. Of course, many publications will have the same list of book contents and in the same order, so it makes reasonable sense to separate this information out so it can be easily reused.
Today I committed some book order data (XML) files which I’m aware of, although they haven’t been tested out against real publications as well as I would like. Also, there might be reason to name some of the systems better before we take it to V1.0. And, of course, it is easy to add systems which I wasn’t aware of with this initial commit. So there’s plenty of room for other people to help with their expertise here.
This is the end of the easy stuff — everything gets a lot more complex from here on in, so my blogs and code additions are likely to become much more sparse.
Any Bible organisational system has to have a way to define languages. We’ll be dealing with Biblical languages like Hebrew, Aramaic, and Greek, as well as the many languages of translations.
Many systems have used 2-letter codes like ‘en‘ or ‘fr‘ for naming languages. But 2-letter codes have a maximum of 26 squared possibilities, which equals 676 — a whole order of magnitude short of being able to represent the some 7,000 languages of the world. And since we want the Bible to reach all peoples of the world, we want to include their languages from the beginning.
So in a truly international system, we’ll need to use codes of at least 3-letters. (Yes, 17,576 codes should be enough!) And the ISO 639-3 standard, which came originally from the Ethnologue and is currently still administered by SIL gives us that.
Today I committed Python code to access the ISO 639-3 information. It’s just a little foundational step that’ll be needed later on.
[I'm not sure yet that it handles everything we need -- for example Americans and New Zealanders both speak English (code 'eng') but we choose, pronounce, and spell many words differently. So we might be needing to extend this language code system sometime when we get into spell-checking and such.]