12 Sep 2012

HathiTrust UnCamp

"Correction Rules!"

Back from a two-day workshop at Indiana University, run by the HathiTrust Research Center folks. HathiTrust is, loosely, a consortium of universities and research libraries in the US which gave volumes to be digitized in the Google Books project. Though these volumes are all available on books.google.com (at least those out-of-copyright), the HathiTrust exists to ensure that duplicate copies are held by a consortium of all the libraries in perpetuity, in case Google isn’t around in a few decades.

The attraction to literary folks — or at least those of us with an interest in data mining — is obvious: tens of millions of books, all digitized in the space of a few years. The tricky question has always been: how do we get access to them, and what kind of algorithms can we run on a corpus of this scale?

Corpus stats

For small-scale projects in the past, many of us were content to build up infrastructure at our local institutions: a big server here, a metadata database there… I set up such systems when I was at UCLA, to work on the 19th Century Nordic-language corpus. This works fine for several hundred or thousand books, but doesn’t make any sense for projects at the million scale.

So instead, the future of large-scale text mining may look something like this:

Running against the HathiTrust corpus

The screenshot above shows me doing a word count of a bunch of Norwegian-language texts — the first item is actually an artifact of the beta-quality tokenizing code, a hyphen, but the rest are the words for it, I, and, so, etc. This is actually a hybrid model, where a python script goes out and gets a zipped objects, decompresses them, and then does the word counting on a local machine. Most users are likely to use a combination of such local analytics, coupled with large-scale (and somewhat less-frequently-run) examinations of large chunks of the collection. That’s what this picture below shows off — visualizations from the Mellon-funded SEASR/MEANDRE toolkit.

Epic Datawall

But regardless of the implementation details, we’re at the cusp of being able to do truly interesting things with out-of-copyright works from the 19th and early 20th Centuries. All we need is for groups like the HathiTrust to navigate a very treacherous landscape of eager literary folks and suspicious publishing industry lawyers. If they succeed, we could derive real insight into all the cultural output that’s been preserved for centuries, and digitized quite suddenly in the span of my own grad school career.

Occupy the Cyberinfrastructure Building

Previous: | Next: