Recently in tech Category

Charleston Conference 2014

In early November I had a chance to travel to South Carolina to attend the Charleston Conference.

Together with colleagues from Yale and ProQuest, I presented a panel on Data Mining on Vendor-Digitized Collections. We focused on our analysis of ProQuest’s Vogue Digital Archive — a collection of every issue since Vogue’s inception in 1892 — as a case study of what libraries and scholars can do with vendor data. Our examples were mostly drawn from our public website that showcases various visual and textual experiments with the Vogue data:

Robots Reading Vogue 1600

Here’s how we framed the larger issue:

This session delves into the rapidly emerging topic of text and data mining (TDM), from the perspectives of a digital humanist, a librarian, a collection development officer and a product manager for a major vendor of digitized content. We will show concrete examples of TDM on a large vendor-digitized in-copyright collection: the Vogue Archive from ProQuest, with over 400,000 pages of text and images dating from 1892 to the present. Several projects in progress at Yale have illuminated the appeal of TDM applications on Vogue to researchers across disciplines ranging from gender studies to art history to computer science. We will address issues of copyright and licensing, file formats and research platforms, new forms of research enabled by TDM, and how vendors and librarians can work to support digital humanities projects. Session attendees who are new to this topic will learn what TDM is and how they might engage with it in their own work. Audience members who have familiarity with TDM will be encouraged to share their experiences and insights.

After the conference was over I had a chance to enjoy a day in the city free of presentation responsibilities. The weather was very pleasant and the sky cooperated to show off the architecture in its best light:

Link to this Post | Leave a Comment

Vogue project coverage

Some nice coverage — both online and in print — of the data-mining project I’ve been working on at Yale.

We have a page on our project site that talks about these “cover averages” in more detail.

Link to this Post | Leave a Comment

DH 2014 Lausanne


Just got back from the 2014 Digital Humanities conference, held this year in Lausanne Switzerland.


As you can see from that interior shot of the auditorium, the facilities were more than adequate. Here’s a shot of the SwissTech Convention Center from outside:

SwissTech Convention Center

The natural setting was dramatic as well, although the mountains were shrouded by clouds when we were there:


Together with a grad student, I presented on a project to map the 170,000 Farm Security Administration photographs taken during the 1930s-40s:

Photogrammar presentation
Thanks to Miriam Posner for this picture

I also co-presented a poster with Mats Malm from Gothenburg, about ways of surfacing related content in large digital literary collections:

Poster presenation
Thanks to Jenny Bergenmar for this picture

We were pretty busy during the week, but there was some time to view the sights in town, such as the Escaliers du marché:

Escaliers du marché

One of the things I am trying to do is take spur-of-the-moment street photographs, alongside more traditional tourist shots of city halls and churches. This couple was sitting in front of the (very impressive) entryway to the cathedral:

Outside the cathedral

Around the corner, a young girl was amusing herself by jumping off of a low wall:


Link to this Post | Leave a Comment

UNLV Library

Had a great tour of the University of Las Vegas Lied Library, as part of the ALA’s annual convention. The highlight for me was seeing their work in building interactive exhibit walls out of multiple multi-touch displays:

UNLV Digital Collections

Although the hardware comes from a vendor, the software layer is all written in-house, and supports great features like multiple users zooming different photographs all at the same time:

UNLV Digital Collections

They’re using this system to highlight their special collections, which include fantastic information about local history:

UNLV Digital Collections

The Digital Collections team also has an iPad app for use in their exhibits:

UNLV Digital Collections

Overall this tour was a real wake-up call to think about what libraries can accomplish when they focus on their unique collections and think about presenting their material in new and more accessible ways. Special Collections reading rooms at many institutions can be rather intimidating places, with rules on how to handle delicate material. These rules are there for are reason, but they tend to discourage shoving artifacts around a table to juxtapose or compare. The mass digitization of Special Collections material gave new life to these items on our computer screens, but didn’t do much to let us physically manipulate these images: we struggle with resizing browser windows and spawning new tabs to get all our material situated on our 11” laptop screens.

The multi-touch exhibit panels I saw at Lied Library, when coupled with UNLV’s own software layer, point towards a future where multiple users can grab, drag, resize and otherwise physically manipulate artifacts on a large surface. Because the images are linked into the metadata in the digital library (ContentDM), there’s good contextual information about what you’re seeing — but it never gets in the way of the visual materials themselves.

Speaking of the visual, I was also struck by some great design work in a newly remodeled multipurpose room:

UNLV Library

UNLV describes its goals for this space as follows:

This space will serve as a state-of-the-art venue to showcase UNLV Libraries’ special collections and comprehensive records that document our region’s history—making them accessible to everyone to experience our past by touching and feeling these artifacts. This event space will serve as a center for academic and cultural dialog, panel discussions, readings and lectures by gaming fellows, authors and visiting scholars.

Link to this Post | Leave a Comment

Jef Raskin's Canon Cat

The ‘newest’ computer I’ve added to my collection is a 1987 machine designed by Jef Raskin, the Canon Cat. Based on the ideas of modeless text editing that Raskin had developed while at Apple, including the Swyft hardware and software enhancements to the Apple //, the Cat arguably represents the original vision for the Macintosh project.

Raskin is actually depicted in the Ashton Kutcher film Jobs, in a brief scene where Steve takes over the Macintosh team, unceremoniously ejecting the bearded and professorial Raskin from the team that Jef had led since 1979. The machine that emerged from the new, Steve Job-managed Mac team was very different that the minimalist appliance that Raskin envisioned: a high-resolution bitmapped display, a mouse, and sophisticated software that required larger amounts of RAM all pushed up the price to $2,500 at launch.


The Canon Cat, which came to market three years later, is the closest vision of what Raskin’s original idea for the Mac might have been. Raskin was able to extend the ideas of the “Leap” keys that he had pioneered on the Apple //-based Swyft systems, giving users two new meta-keys (LEAP FORWARD and LEAP BACKWARD) that, when held down during typing, zapped the user to the exact place in the text where those words occurred. With such a radical system of navigation, there was no need for a visible file system, or discrete documents in different windows — the Cat provided a scrolling window containing everything you had ever written (or at least as much could fit on a 3.5” disk.) The closest parallel today would be navigating a webpage by using the browser’s “Find on page” command.

Although such a system is a big cognitive leap from how most software (previously and since) worked, Raskin claimed that keeping all your writing in a big scrolling list would avoid several levels of cognitive abstraction that traditional GUIs required the user to master. Reading through the Canon Cat manual is interesting because — whether due to Raskin’s focus on simplicity and appliance computing, or the sponsoring corporation Canon’s historical focus on office equipment — one encounters a machine much more limited and simple than the Apple Macintosh, despite shipping three years later. This was a machine for office workers, writers, and others who only needed to manipulate text; Desktop Publishers need not apply. But that doesn’t mean there wasn’t room of innovation: after mastering the distinctive LEAP key system, users could select text and “compute” it using the built-in math functions, or select a phone number underneath a friend’s name and have the Cat’s built-in modem dial them directly. Restricting the functional domain of the computer down to the realm of A-Z meant that the user experience could be tightly honed, the computer booting in mere seconds and the screen instantly responsive with an image of the exact place you had left off typing.

There are some scans of the Cat’s manuals and marketing materials available from, and a great photo of Raskin himself using a Cat on the website of Computer History Museum.


Link to this Post | Leave a Comment

Alabama Digital Humanities Center

Back from a great trip to Tuscaloosa and the University of Alabama’s Digital Humanities Center. My first time on campus; the library is host to a great DH space and some interesting projects that touch on both local history and broader topics.

University of Alabama library

Link to this Post | Leave a Comment

Visualizing Texts on Maps

Just a sneak preview of a website I’m building — the idea is to have an algorithm read through a bunch of magazine articles, find place names, and map those places onto a city or region of a country. Then, when the user hovers over the place, a list of sentences appear on the right, showing the context for each occurrence. Each small red dot is a place mentioned in the text; larger blobs of color show concentrations of places.


Link to this Post | Leave a Comment

Hyde Park Transport, 1926

I was curious what it would look like to take this Rand McNally map of Chicago public transportation networks circa 1926 and overlay it on Google’s modern aerial photography of the city.


When I first came to Hyde Park in 1993, there was an El track along 63rd street. That part of the Green Line was torn down by the time I left in 1997.


63rd is certainly lighter and more open, but the demolition of the overhead tracks has hardly spurred economic development. The colored lines of the 1926 map hover over a mostly-vacant streetscape of today.


I did this experiment with ArcGIS (for rectification) and GeoServer (for serving the map tiles into Google Earth.)

Link to this Post | Leave a Comment

DHCS 2012

We just wrapped up the 2012 Chicago Colloquium on Digital Humanities & Computer Science (DHCS). Below is a picture from one of my favorite poster sessions, a team working on Classical Greek & Latin textual applications for both iOS and Android:

Parsing and Presentation of Digital Classical Texts on Mobile Platforms

More pictures from the conference are online here.

Link to this Post | Leave a Comment

topic-diary.png I’ve been doing some work recently on the 15th-century diary of Luca Landucci, a Florentine apothecary who kept a detailed diary for over half a century.

The text we’re working with has surely been normalized to a degree, but still represents a departure from contemporary Italian. This has an impact on text preparation — the process of stoplisting, or stripping out less-meaningful words to leave behind really interesting terms and phrases for analysis. Normally I just run some utilities that list the most-frequently-used words in a corpus (such as “the” in English) and use the results as a starting point for a stoplist of words to ignore. But I noticed in the case of Landucci’s diary that he begins almost every entry with a sentence that contains the name of a month. Because these months are really more metadata than data — a kind of in-line datestamp — I figured it would make more sense to strip them out and so set about adding the Italian names for all twelve months into the stoplist. But after re-running my analysis, I was confused to see the months still popping up. The answer, of course, was that Luca wrote in a kind of Tuscan dialect, which differs in small ways from modern Italian. I stoplisted “settembre”, but Luca’s “settenbre” made it through the filter intact, along with “giennaio” (modern “gennaio”). So one by-product of this project will be a custom stoplist for Luca’s orthography — perhaps broad enough to function on other Tuscan text of the 1400s, perhaps not.

Another part of the preparation process was to split up the diary into individual days. Since the text I received was not marked-up in any kind of TEI or XML, I had to figure out how to do this by hand. Luckily, Luca was remarkably consistent, starting every entry with “E a dì 8 aprile 1498…”, “E a dì 16 di febraio 1495”, “E a dì 7 detto”, and the like. I sliced apart the diary into chunks based on this pattern, using the Gnu project’s CSPLIT function:

csplit -k -f landucci_chunks/$i -z -b _%05d landucci.txt '/E a dì /' '{*}'

This left me with about 1,600 individual entries, of relatively uniform size. There were, of course, outliers — one epic entry approached 12k, and several dozen entries were only a few words long — a kind of Renaissance Twitter stream. I might consider going back and removing the very large diary entry, or splitting into a few chunks, at a further stage. But in general, I was happy with the distribution of size of the individual entries. As with so much else human culture, the varying length of Luca’s writing demonstrates a power-law curve:


Running Mallet’s topic modeling code on the resulting files was my next step. I chose twenty topics to begin with, since it seemed like a reasonable first guess.


I was pleasantly surprised with the cohesion of the results — some interesting patterns included a topic about that most famous Florentine family, the Medici:


Also nice to see was a topic I’ve (provisionally) labeled “Economics”, which goes into the intricacies of taxation, inflation and commodity pricing:


This is just the first cut of the data — I want to refer to a printed edition and figure out how many entries their “should” be, to see if it’s anywhere near the 1,580 that my chunking algorithm produced. And it would be nice to assign a ISO-format date (like 1459-12-21) to each entry, so that we could graph topic saturation over time. This might let us see how certain topics, such as economics, waxed and waned as a matter of concern from Luca. But even at this early stage, I think this project reinforces the appropriateness of diaries as raw material for topic modeling (cf, of course, Cameron Blevins’ fantastic Martha Ballard diary project.) Unlike novels and other forms of print culture, diaries are relatively easy to cut into logical pieces and — at least in the case of Martha Ballard and Luca Landucci — offer a fascinating glimpse of one writer chronicling events over a long period of time.

Link to this Post | Leave a Comment

HathiTrust UnCamp

"Correction Rules!"

Back from a two-day workshop at Indiana University, run by the HathiTrust Research Center folks. HathiTrust is, loosely, a consortium of universities and research libraries in the US which gave volumes to be digitized in the Google Books project. Though these volumes are all available on (at least those out-of-copyright), the HathiTrust exists to ensure that duplicate copies are held by a consortium of all the libraries in perpetuity, in case Google isn’t around in a few decades.

The attraction to literary folks — or at least those of us with an interest in data mining — is obvious: tens of millions of books, all digitized in the space of a few years. The tricky question has always been: how do we get access to them, and what kind of algorithms can we run on a corpus of this scale?

Corpus stats

For small-scale projects in the past, many of us were content to build up infrastructure at our local institutions: a big server here, a metadata database there… I set up such systems when I was at UCLA, to work on the 19th Century Nordic-language corpus. This works fine for several hundred or thousand books, but doesn’t make any sense for projects at the million scale.

So instead, the future of large-scale text mining may look something like this:

Running against the HathiTrust corpus

The screenshot above shows me doing a word count of a bunch of Norwegian-language texts — the first item is actually an artifact of the beta-quality tokenizing code, a hyphen, but the rest are the words for it, I, and, so, etc. This is actually a hybrid model, where a python script goes out and gets a zipped objects, decompresses them, and then does the word counting on a local machine. Most users are likely to use a combination of such local analytics, coupled with large-scale (and somewhat less-frequently-run) examinations of large chunks of the collection. That’s what this picture below shows off — visualizations from the Mellon-funded SEASR/MEANDRE toolkit.

Epic Datawall

But regardless of the implementation details, we’re at the cusp of being able to do truly interesting things with out-of-copyright works from the 19th and early 20th Centuries. All we need is for groups like the HathiTrust to navigate a very treacherous landscape of eager literary folks and suspicious publishing industry lawyers. If they succeed, we could derive real insight into all the cultural output that’s been preserved for centuries, and digitized quite suddenly in the span of my own grad school career.

Occupy the Cyberinfrastructure Building

Link to this Post | Leave a Comment

Amazon Kindle and Text Mining


One of the software features that Amazon rolled out in today’s launch of the new Kindle Fire and Kindle Paperwhite was “X-Ray”, an umbrella term for reference and lookup on both texts and movies. The X-Ray name seems to encompass a number of different user interface affordances, some of which rely upon explicit metadata, and others which work on implicit — or latent — patterns. I’m interested in the latter of these: how Amazon is exposing some hidden structures of text, and what that might mean for folks who are interested in text mining.

When I saw the demo of X-Ray for Books in The Verge’s liveblog of the Amazon event, I was captivated by an image that showed a kind of “heatmap” of character saturation over the course of a book:


This screen, shown on the new Paperwhite touch-enabled Kindle, shows something really neat: words which are the names of characters are recognized as such, and the frequency of each is mapped to a thin linear rectangle, presumably stretching from the start to the end of the book. (In fact, the scope of the visualization is selectable via the buttons: Page, Chapter and Book.) More black bars towards the end means a character shows up in the last part of the book, and vice-versa.

Of course, we don’t know very much about how Amazon is implementing this feature. Does the visualization of, say, Ramsay Bolton imply that he’s absent from the last fifth of the book? Or just that he doesn’t say very much? We can imagine a couple ways of going about preparing the underlying data:

1. A real human being goes through and marks each page for the presence of a character.
2. A robot tries to guesstimate, based on such tricks as linking “Ramsay Bolston” with “Ramsay” and “he” in the following paragraphs
3. Some hybrid of these — robotic guesses followed by human spot-checking.

There are lots ways in which such a simple visualization is problematic (is the whale mostly absent — or undeniably and constantly present — throughout Moby Dick?), but as a first-order approximation this kind of visualization works well, especially when main characters are lined up all on one screen as shown above. Patterns immediately become clear — you can imagine how the Ghost and Fortinbras would bracket Hamlet.

How does Amazon choose what terms to show for this? You don’t want to show the distribution of the word “the” in most books, and lots of other common words would result in banal visualizations. This problem in machine learning is called Named-Entity Recognition, and Amazon’s marketing material provides some hints at their approach:

For Kindle Touch, Amazon invented X-Ray - a new feature that lets customers explore the “bones of the book.” […] Amazon built X-Ray using its expertise in language processing and machine learning, access to significant storage and computing resources with Amazon S3 and EC2, and a deep library of book and character information. The vision is to have every important phrase in every book.

One simple Named Entity Recognition technique (in English) is to take capitalized words and see if they match against common lists of proper names, places and other kinds of things — if so, make them into links and see if reference sources such as Wikipedia has content for them. The addition of Shelfari (an Amazon acquisition which aggregates info specific to books, including characters and physical locations) to the stable of lookup sources is a great move, because its more likely to have fancruft data that Wikipedia deems non-notable (the names of Bilbo Baggin’s distant cousins, etc.)


What’s really interesting about this feature, though, is that it’s not really new at all: X-Ray for Books was introduced along with the first touch-screen Kindle way back in September 2011. Check out this video excerpt from that event, which shows the heatmap visualizations, as well as how X-Ray loads in Wikipedia and Shelfari content:

In a way, it’s unsurprising that this X-Ray feature debuted with the Kindle Touch: no other hardware Kindle device has the kind of user interaction model that freeform querying requires. Although even first-generation Kindles let you highlight passages and save simple excerpts as clippings, these were line-based, not at the level of individual words. Software-only Kindle implementations, such as the iOS app, do not presently offer X-Ray — only dictionary lookups, together with auxiliary buttons for Google and Wikipedia. The Kindle App for Mac OS X offers “Book Extras by Shelfari,” but no frequency visualizations of any kind. (Kindle for Android offers only a dictionary, Kindle for WebOS offers not even that.)

kindletouch.pngAnd the restriction of Book X-Ray to the Kindle Touch — a kind of “stealth rollout” — helps make my surprise at a feature that had been around over a year a little less embarrassing. The Touch, with its awkward infrared-based touch sensor, deep bezel, and slow page-turning performance, can hardly have been the most popular device sold. Marco Arment’s review of several Kindles and clones rated the Kindle Touch pretty poorly — and, interestingly, made no mention of the X-Ray feature. The technology was hiding in plain sight, in one of the more awkward of the Kindle cousins.

Given the importance of touchscreens to this kind of textual exploration, however, I am surprised that Amazon’s other touch-enabled Kindles — the original Kindle Fire — didn’t have this Book X-Ray feature. I’ve never used a first-generation Fire myself (reviews have been pretty bad) but support forum posts confirm that it does not have the X-Ray feature for books. The comments in this Verge article about the Fire software changes seem to suggest nobody knows if first-gen Fire hardware will see the revamped software — and with it the X-Ray features that come along with it.

Even Kindle Touch owners themselves could be excused for overlooking the feature. The hardware Kindles are known for excellent screen readability, amazing battery life, and a great book catalog — using them as data-mining tools may be too much of a leap. In fact, a recent thread about X-Ray on the Kindle Touch on Amazon’s support forum seems to show a mixture of confusion and disinterest about the X-Ray.

But with the large number of new hardware Kindles that are now available from Amazon with touch interfaces (only one legacy $69 device is left without touch capabilities, if I understand things correctly), we can expect the X-Ray feature to — perhaps — gain more visibility. If Amazon ever consistently deploys Book X-Ray across all of its hardware and software platforms with arbitrary term selection (via mouse or finger) — desktop and mobile apps inclusive — then there’s the possibility that students reading novels on the Kindle platform will have access to an easy and engaging first glimpse into text mining. And not just on “classic” texts that we teach in literature seminars — from Amazon’s examples, Book X-Ray may well first be deployed on popular contemporary fiction. Although term frequency by itself may seem like a simplistic thing to focus on, in Ted Underwood’s words:

you can build complex arguments on a very simple foundation. Yes, at bottom, text mining is often about counting words. But a) words matter and b) they hang together in interesting ways, like individual dabs of paint that together start to form a picture.

Kindle Book X-Ray may be the first chance many people get to hold a digital humanities paintbrush.

Link to this Post | Leave a Comment

About this Archive

This page is an archive of recent entries in the tech category.

More entries in tech: tech: December 2014 (1)
tech: October 2014 (1)
tech: July 2014 (1)
tech: June 2014 (1)
tech: March 2014 (1)
tech: March 2013 (1)
tech: December 2012 (1)
tech: November 2012 (3)
tech: September 2012 (2)
tech: August 2012 (2)
tech: July 2012 (2)
tech: June 2012 (1)
tech: February 2012 (1)
tech: January 2012 (1)
tech: October 2011 (1)
tech: September 2011 (1)
tech: August 2011 (1)
tech: June 2011 (1)
tech: April 2011 (4)
tech: February 2011 (1)
tech: November 2010 (2)
tech: July 2010 (1)
tech: June 2010 (3)
tech: March 2010 (1)
tech: February 2010 (1)
tech: December 2009 (1)
tech: October 2009 (1)
tech: September 2009 (1)
tech: August 2009 (2)
tech: July 2009 (10)
tech: June 2009 (2)
tech: May 2009 (2)
tech: April 2009 (3)
tech: December 2008 (1)
tech: November 2008 (2)
tech: September 2008 (1)
tech: August 2008 (2)
tech: July 2008 (1)
tech: June 2008 (1)
tech: February 2008 (2)
tech: January 2008 (1)
tech: September 2007 (2)
tech: August 2007 (4)
tech: July 2007 (2)
tech: June 2007 (3)
tech: July 2006 (3)
tech: June 2006 (1)
tech: March 2006 (1)
tech: January 2006 (1)
tech: December 2005 (1)
tech: October 2005 (1)
tech: September 2005 (3)
tech: August 2005 (3)
tech: July 2005 (7)
tech: March 2005 (2)
tech: February 2005 (3)
tech: January 2005 (1)
tech: December 2004 (1)
tech: September 2004 (1)
tech: June 2004 (1)
tech: April 2004 (1)
tech: March 2004 (1)
tech: February 2004 (1)
tech: November 2003 (2)
tech: October 2003 (1)
tech: September 2003 (1)
tech: July 2003 (3)
tech: June 2003 (1)
tech: May 2003 (3)
tech: April 2003 (8)
tech: March 2003 (5)
tech: February 2003 (6)

school is the previous category.

Find recent content on the main index or look in the archives to find all content.

Recent Activity

Sunday Dec 21
Thursday Dec 18
Wednesday Dec 17
Tuesday Dec 16
Monday Dec 15
Saturday Dec 13
Friday Dec 12
Thursday Dec 11
Wednesday Dec 10