Back from a great trip to Tuscaloosa and the University of Alabama’s Digital Humanities Center. My first time on campus; the library is host to a great DH space and some interesting projects that touch on both local history and broader topics.
Recently in tech Category
Just a sneak preview of a website I’m building — the idea is to have an algorithm read through a bunch of magazine articles, find place names, and map those places onto a city or region of a country. Then, when the user hovers over the place, a list of sentences appear on the right, showing the context for each occurrence. Each small red dot is a place mentioned in the text; larger blobs of color show concentrations of places.

I was curious what it would look like to take this Rand McNally map of Chicago public transportation networks circa 1926 and overlay it on Google’s modern aerial photography of the city.

When I first came to Hyde Park in 1993, there was an El track along 63rd street. That part of the Green Line was torn down by the time I left in 1997.

63rd is certainly lighter and more open, but the demolition of the overhead tracks has hardly spurred economic development. The colored lines of the 1926 map hover over a mostly-vacant streetscape of today.

I did this experiment with ArcGIS (for rectification) and GeoServer (for serving the map tiles into Google Earth.)
We just wrapped up the 2012 Chicago Colloquium on Digital Humanities & Computer Science (DHCS). Below is a picture from one of my favorite poster sessions, a team working on Classical Greek & Latin textual applications for both iOS and Android:
More pictures from the conference are online here.
I’ve been doing some work recently on the 15th-century diary of Luca Landucci, a Florentine apothecary who kept a detailed diary for over half a century.
The text we’re working with has surely been normalized to a degree, but still represents a departure from contemporary Italian. This has an impact on text preparation — the process of stoplisting, or stripping out less-meaningful words to leave behind really interesting terms and phrases for analysis. Normally I just run some utilities that list the most-frequently-used words in a corpus (such as “the” in English) and use the results as a starting point for a stoplist of words to ignore. But I noticed in the case of Landucci’s diary that he begins almost every entry with a sentence that contains the name of a month. Because these months are really more metadata than data — a kind of in-line datestamp — I figured it would make more sense to strip them out and so set about adding the Italian names for all twelve months into the stoplist. But after re-running my analysis, I was confused to see the months still popping up. The answer, of course, was that Luca wrote in a kind of Tuscan dialect, which differs in small ways from modern Italian. I stoplisted “settembre”, but Luca’s “settenbre” made it through the filter intact, along with “giennaio” (modern “gennaio”). So one by-product of this project will be a custom stoplist for Luca’s orthography — perhaps broad enough to function on other Tuscan text of the 1400s, perhaps not.
Another part of the preparation process was to split up the diary into individual days. Since the text I received was not marked-up in any kind of TEI or XML, I had to figure out how to do this by hand. Luckily, Luca was remarkably consistent, starting every entry with “E a dì 8 aprile 1498…”, “E a dì 16 di febraio 1495”, “E a dì 7 detto”, and the like. I sliced apart the diary into chunks based on this pattern, using the Gnu project’s CSPLIT function:
csplit -k -f landucci_chunks/$i -z -b _%05d landucci.txt '/E a dì /' '{*}'
This left me with about 1,600 individual entries, of relatively uniform size. There were, of course, outliers — one epic entry approached 12k, and several dozen entries were only a few words long — a kind of Renaissance Twitter stream. I might consider going back and removing the very large diary entry, or splitting into a few chunks, at a further stage. But in general, I was happy with the distribution of size of the individual entries. As with so much else human culture, the varying length of Luca’s writing demonstrates a power-law curve:

Running Mallet’s topic modeling code on the resulting files was my next step. I chose twenty topics to begin with, since it seemed like a reasonable first guess.

I was pleasantly surprised with the cohesion of the results — some interesting patterns included a topic about that most famous Florentine family, the Medici:

Also nice to see was a topic I’ve (provisionally) labeled “Economics”, which goes into the intricacies of taxation, inflation and commodity pricing:

This is just the first cut of the data — I want to refer to a printed edition and figure out how many entries their “should” be, to see if it’s anywhere near the 1,580 that my chunking algorithm produced. And it would be nice to assign a ISO-format date (like 1459-12-21) to each entry, so that we could graph topic saturation over time. This might let us see how certain topics, such as economics, waxed and waned as a matter of concern from Luca. But even at this early stage, I think this project reinforces the appropriateness of diaries as raw material for topic modeling (cf, of course, Cameron Blevins’ fantastic Martha Ballard diary project.) Unlike novels and other forms of print culture, diaries are relatively easy to cut into logical pieces and — at least in the case of Martha Ballard and Luca Landucci — offer a fascinating glimpse of one writer chronicling events over a long period of time.
Back from a two-day workshop at Indiana University, run by the HathiTrust Research Center folks. HathiTrust is, loosely, a consortium of universities and research libraries in the US which gave volumes to be digitized in the Google Books project. Though these volumes are all available on books.google.com (at least those out-of-copyright), the HathiTrust exists to ensure that duplicate copies are held by a consortium of all the libraries in perpetuity, in case Google isn’t around in a few decades.
The attraction to literary folks — or at least those of us with an interest in data mining — is obvious: tens of millions of books, all digitized in the space of a few years. The tricky question has always been: how do we get access to them, and what kind of algorithms can we run on a corpus of this scale?
For small-scale projects in the past, many of us were content to build up infrastructure at our local institutions: a big server here, a metadata database there… I set up such systems when I was at UCLA, to work on the 19th Century Nordic-language corpus. This works fine for several hundred or thousand books, but doesn’t make any sense for projects at the million scale.
So instead, the future of large-scale text mining may look something like this:
The screenshot above shows me doing a word count of a bunch of Norwegian-language texts — the first item is actually an artifact of the beta-quality tokenizing code, a hyphen, but the rest are the words for it, I, and, so, etc. This is actually a hybrid model, where a python script goes out and gets a zipped objects, decompresses them, and then does the word counting on a local machine. Most users are likely to use a combination of such local analytics, coupled with large-scale (and somewhat less-frequently-run) examinations of large chunks of the collection. That’s what this picture below shows off — visualizations from the Mellon-funded SEASR/MEANDRE toolkit.
But regardless of the implementation details, we’re at the cusp of being able to do truly interesting things with out-of-copyright works from the 19th and early 20th Centuries. All we need is for groups like the HathiTrust to navigate a very treacherous landscape of eager literary folks and suspicious publishing industry lawyers. If they succeed, we could derive real insight into all the cultural output that’s been preserved for centuries, and digitized quite suddenly in the span of my own grad school career.

One of the software features that Amazon rolled out in today’s launch of the new Kindle Fire and Kindle Paperwhite was “X-Ray”, an umbrella term for reference and lookup on both texts and movies. The X-Ray name seems to encompass a number of different user interface affordances, some of which rely upon explicit metadata, and others which work on implicit — or latent — patterns. I’m interested in the latter of these: how Amazon is exposing some hidden structures of text, and what that might mean for folks who are interested in text mining.
When I saw the demo of X-Ray for Books in The Verge’s liveblog of the Amazon event, I was captivated by an image that showed a kind of “heatmap” of character saturation over the course of a book:

This screen, shown on the new Paperwhite touch-enabled Kindle, shows something really neat: words which are the names of characters are recognized as such, and the frequency of each is mapped to a thin linear rectangle, presumably stretching from the start to the end of the book. (In fact, the scope of the visualization is selectable via the buttons: Page, Chapter and Book.) More black bars towards the end means a character shows up in the last part of the book, and vice-versa.
Of course, we don’t know very much about how Amazon is implementing this feature. Does the visualization of, say, Ramsay Bolton imply that he’s absent from the last fifth of the book? Or just that he doesn’t say very much? We can imagine a couple ways of going about preparing the underlying data:
1. A real human being goes through and marks each page for the presence of a character.
2. A robot tries to guesstimate, based on such tricks as linking “Ramsay Bolston” with “Ramsay” and “he” in the following paragraphs
3. Some hybrid of these — robotic guesses followed by human spot-checking.
There are lots ways in which such a simple visualization is problematic (is the whale mostly absent — or undeniably and constantly present — throughout Moby Dick?), but as a first-order approximation this kind of visualization works well, especially when main characters are lined up all on one screen as shown above. Patterns immediately become clear — you can imagine how the Ghost and Fortinbras would bracket Hamlet.
How does Amazon choose what terms to show for this? You don’t want to show the distribution of the word “the” in most books, and lots of other common words would result in banal visualizations. This problem in machine learning is called Named-Entity Recognition, and Amazon’s marketing material provides some hints at their approach:
For Kindle Touch, Amazon invented X-Ray - a new feature that lets customers explore the “bones of the book.” […] Amazon built X-Ray using its expertise in language processing and machine learning, access to significant storage and computing resources with Amazon S3 and EC2, and a deep library of book and character information. The vision is to have every important phrase in every book.
One simple Named Entity Recognition technique (in English) is to take capitalized words and see if they match against common lists of proper names, places and other kinds of things — if so, make them into links and see if reference sources such as Wikipedia has content for them. The addition of Shelfari (an Amazon acquisition which aggregates info specific to books, including characters and physical locations) to the stable of lookup sources is a great move, because its more likely to have fancruft data that Wikipedia deems non-notable (the names of Bilbo Baggin’s distant cousins, etc.)

What’s really interesting about this feature, though, is that it’s not really new at all: X-Ray for Books was introduced along with the first touch-screen Kindle way back in September 2011. Check out this video excerpt from that event, which shows the heatmap visualizations, as well as how X-Ray loads in Wikipedia and Shelfari content:
In a way, it’s unsurprising that this X-Ray feature debuted with the Kindle Touch: no other hardware Kindle device has the kind of user interaction model that freeform querying requires. Although even first-generation Kindles let you highlight passages and save simple excerpts as clippings, these were line-based, not at the level of individual words. Software-only Kindle implementations, such as the iOS app, do not presently offer X-Ray — only dictionary lookups, together with auxiliary buttons for Google and Wikipedia. The Kindle App for Mac OS X offers “Book Extras by Shelfari,” but no frequency visualizations of any kind. (Kindle for Android offers only a dictionary, Kindle for WebOS offers not even that.)
And the restriction of Book X-Ray to the Kindle Touch — a kind of “stealth rollout” — helps make my surprise at a feature that had been around over a year a little less embarrassing. The Touch, with its awkward infrared-based touch sensor, deep bezel, and slow page-turning performance, can hardly have been the most popular device sold. Marco Arment’s review of several Kindles and clones rated the Kindle Touch pretty poorly — and, interestingly, made no mention of the X-Ray feature. The technology was hiding in plain sight, in one of the more awkward of the Kindle cousins.
Given the importance of touchscreens to this kind of textual exploration, however, I am surprised that Amazon’s other touch-enabled Kindles — the original Kindle Fire — didn’t have this Book X-Ray feature. I’ve never used a first-generation Fire myself (reviews have been pretty bad) but support forum posts confirm that it does not have the X-Ray feature for books. The comments in this Verge article about the Fire software changes seem to suggest nobody knows if first-gen Fire hardware will see the revamped software — and with it the X-Ray features that come along with it.
Even Kindle Touch owners themselves could be excused for overlooking the feature. The hardware Kindles are known for excellent screen readability, amazing battery life, and a great book catalog — using them as data-mining tools may be too much of a leap. In fact, a recent thread about X-Ray on the Kindle Touch on Amazon’s support forum seems to show a mixture of confusion and disinterest about the X-Ray.
But with the large number of new hardware Kindles that are now available from Amazon with touch interfaces (only one legacy $69 device is left without touch capabilities, if I understand things correctly), we can expect the X-Ray feature to — perhaps — gain more visibility. If Amazon ever consistently deploys Book X-Ray across all of its hardware and software platforms with arbitrary term selection (via mouse or finger) — desktop and mobile apps inclusive — then there’s the possibility that students reading novels on the Kindle platform will have access to an easy and engaging first glimpse into text mining. And not just on “classic” texts that we teach in literature seminars — from Amazon’s examples, Book X-Ray may well first be deployed on popular contemporary fiction. Although term frequency by itself may seem like a simplistic thing to focus on, in Ted Underwood’s words:
…you can build complex arguments on a very simple foundation. Yes, at bottom, text mining is often about counting words. But a) words matter and b) they hang together in interesting ways, like individual dabs of paint that together start to form a picture.
Kindle Book X-Ray may be the first chance many people get to hold a digital humanities paintbrush.
I wanted to post this picture I took last month at the Henry Ford Museum:
The Violano Virtuoso was a kind of automated music cabinet created by Swedish immigrant Henry Konrad Sandell around the turn of the century. The four constantly-rotating discs, which perform the same function as the bow in a traditional violin, allow it to play all strings at once — something very few people have ever heard.
I was at Pumping Station One, a Chicago hackerspace, last night to build a small Arduino-based device. Though the Arduino chipset can be reprogrammed to do a variety of things (blink lights, spin motors, make beeps), this particular device will eventually turn into a home-built TV-B-Gone, a gadget invented by Mitch Altman to emit “OFF” codes in as many infrared control languages as possible. Here, Mitch holds up a version of the device midway through its construction:
Though of course the remotes that come with televisions don’t contain anything like sophisticated Arduino-based circuitry, all they have to do is work with one particular model. TV-B-Gone devices need to be updated to reflect an ever-changing mix of IR codes, plus possibly do other things like lower volume. Thus the re-programmable microcomputer inside:
My job this summer is to turn these parts:
… into this:
The parts on the table above are only some of what I’ll need: cameras, assorted technical bits and bobs, glass, and paint will also need to happen. The kit — and the Open Hardware plans on which its based [.zip file] — are from Dan Reetz and the community at DIYBookScanner.org. Dan’s been at this for many years: I built a version of the (much smaller) earlier design back when I lived in Seattle. This new iteration is branded the “Hackerspace Scanner”, in the sense that it has the scale and complexity to serve well in that context.
The key innovation of this project, in my opinion, is that it solves one big problem while purposefully not solving another. The problem it leaves unsolved is automatic page turning: you can find scanners that do this all robotically, with fancy vacuums, but they’re extremely expensive. Google themselves, as we can readily see, prefers to keep humans in the middle of this process.
So if there are no fancy robots turning the pages, how do you process a whole book in a reasonable amount of time? The answer is a system of pulleys and bungie cords, which counterbalance the V-shaped glass bracket which keeps both sides of the volume flat and ready to be photographed cleanly. Attached to the handle which lets you raise and lower the bracket easily is a bicycle handbrake, which can be been repurposed to either 1) trigger physical remote shutters, if you use DSLRs, or 2) actuate some kind of software-based USB signal to custom CHDK firmware on Canon point-and-shoots. Either way, you can integrate the raising, page-turning, lowering, and photographing steps into one (relatively) simple motion.
I’ll try to document my progress here on the blog as the project goes forward.
I’ve survived the first week of a NEH Digital Humanities Summer Institute, held on the UCLA campus. This workshop is focused on “digital cultural mapping,” a specialty here at UCLA, and we’re availing ourselves of both the great facilities such as the new Digital Humanities center in Young Library…
…as well as local talent such as Yoh, an expert in web-based geospatial data:
The UCLA facilities are seriously great. Check out the high contrast of the rear-projection system, surround by nouveau Mid-Mentury Modern detailing that responds to the 1964 architecture of Young Library:
I’m working on a project on the Florentine Renaissance — here’s Niall showing his work on the acoustic landscape of Florence: the bells citizens would have heard at each hour, as legislative, executive, private, and ecclesiastical towers rang throughout the day:


























