Cryptographic hashes and RESTful URIs

In a recent post to the Open Scriptures mailing list, it was suggested that we use md5 (or another cryptographic hash) to generate unique IDs for each token (a “token” is the fundamental unit of text (most often a word) in our API database models). Today we discussed the implementation of this on IRC, and it was fairly stimulating.

First of all, md5 is broken and deprecated, due to possible collisions (two different pieces of data can result in the same hash). Since we will be dealing with millions of tokens, we decided not to test our luck, unlikely though a problem may be. SHA-256 has no known collisions, so we decided it was best to use that algorithm.

SHA-256 is implemented in Python’s standard library hashlib, so that is good. For exapmle:

>>> import hashlib >>> hashlib.sha256("Hello world!").digest() '\xc0S^K\xe2\xb7\x9f\xfd\x93)\x13\x05Ck\xf8\x891NJ?\xae\xc0^\xcf\xfc\xbb}\xf3\x1a\xd9\xe5\x1a'

Needless to say, such a digest would not be very good for use in a RESTful URI scheme. So, hashlib also offers a hexadecimal option:

>>> hashlib.sha256("Hello world!").hexdigest() 'c0535e4be2b79ffd93291305436bf889314e4a3faec05ecffcbb7df31ad9e51a'

That is still not the best, since that makes for a very long string. So, we have the option of using base64 encoding:

>>> import base64 >>> base64.b64encode(hashlib.sha256("Hello world!").digest()) 'wFNeS+K3n/2TKRMFQ2v4iTFOSj+uwF7P/Lt98xrZ5Ro='

That is shorter, but it includes the “/” character, which is a no-no for URI design. Luckily base64 includes a function for this exact purpose:

>>> base64.urlsafe_b64encode(hashlib.sha256("Hello world!").digest()) 'wFNeS-K3n_2TKRMFQ2v4iTFOSj-uwF7P_Lt98xrZ5Ro='

So that is safe for URIs, but being case-sensitive and having ambiguous characters makes it not the best for working with. So, base32 to the rescue:

>>> base64.b32encode(hashlib.sha256("Hello world!").digest()) 'YBJV4S7CW6P73EZJCMCUG27YREYU4SR7V3AF5T74XN67GGWZ4UNA===='

There you have it: shorter than hex, and easier to work with than base64. If we end up using cryptographic hashes an token unique IDs, I’m pretty sure this is how we’ll do it.

The discussion around using cryptographic hashes as unique identifiers in our models is ongoing. Essentially we need to decide how best to make a unique hash for each token. Please join us in #openscriptures on irc.freenode.net with any input.

Comments

david

October 6th, 2010 at 6:28 pm

In the BibleTech 2010 video, the presenter stated that each token’s position is immutable for a given source text. In this case, you just need a tuple { source-text-id, token offset } to universally identify every position in a work. This tuple can be made RESTful with a URI scheme like http://openscriptures.org/tokens/the-name-of-the-work#TokenOffset. You only need *cryptographic* hashes if you need to assert facts about the provenance of a token. git, a popular open-source version control tool uses cryptographic hashes for universally identifying a changeset. There is a lot of information about how and why git made this design choice on Google.
david

October 6th, 2010 at 6:33 pm

The last thing that the Linked Data effort needs is one more unreadable URI scheme that doesn’t allow for machines or humans to reason about the contents behind it.

Subscribe to the comments feed.