Recently in tech Category

Digital Preservation

While home over the holidays I was interested in seeing what the earliest digital document I could find would be. I think the best contender is this circa-1985 5.25” floppy disk, which probably holds WordStar files:

I have a few machines with disk controllers that can use such a floppy disk drive — the drives themselves go for about $10-$30 on eBay. The problems I’m likely to encounter are both media failure due to physical degradation, and/or random electromagnetic radiation from the sun having flipped some of the bits. Both of these could turn part or all of the files into gibberish. In that case, there’s an modern floppy controller called KryoFlux that hooks up to a modern PC and uses more advanced/heroic techniques to try and read the bad parts of the disk repeatedly, hundreds of thousands of times. With luck, even badly-damaged disks can give up some of their secrets.

Link to this Post | Leave a Comment

Charleston Conference 2014

In early November I had a chance to travel to South Carolina to attend the Charleston Conference.

Together with colleagues from Yale and ProQuest, I presented a panel on Data Mining on Vendor-Digitized Collections. We focused on our analysis of ProQuest’s Vogue Digital Archive — a collection of every issue since Vogue’s inception in 1892 — as a case study of what libraries and scholars can do with vendor data. Our examples were mostly drawn from our public website that showcases various visual and textual experiments with the Vogue data:

Robots Reading Vogue 1600

Here’s how we framed the larger issue:

This session delves into the rapidly emerging topic of text and data mining (TDM), from the perspectives of a digital humanist, a librarian, a collection development officer and a product manager for a major vendor of digitized content. We will show concrete examples of TDM on a large vendor-digitized in-copyright collection: the Vogue Archive from ProQuest, with over 400,000 pages of text and images dating from 1892 to the present. Several projects in progress at Yale have illuminated the appeal of TDM applications on Vogue to researchers across disciplines ranging from gender studies to art history to computer science. We will address issues of copyright and licensing, file formats and research platforms, new forms of research enabled by TDM, and how vendors and librarians can work to support digital humanities projects. Session attendees who are new to this topic will learn what TDM is and how they might engage with it in their own work. Audience members who have familiarity with TDM will be encouraged to share their experiences and insights.

After the conference was over I had a chance to enjoy a day in the city free of presentation responsibilities. The weather was very pleasant and the sky cooperated to show off the architecture in its best light:

Link to this Post | Leave a Comment

Vogue project coverage

Some nice coverage — both online and in print — of the data-mining project I’ve been working on at Yale.

We have a page on our project site that talks about these “cover averages” in more detail.

Link to this Post | Leave a Comment

DH 2014 Lausanne

Lausanne

Just got back from the 2014 Digital Humanities conference, held this year in Lausanne Switzerland.

DH2014

As you can see from that interior shot of the auditorium, the facilities were more than adequate. Here’s a shot of the SwissTech Convention Center from outside:

SwissTech Convention Center

The natural setting was dramatic as well, although the mountains were shrouded by clouds when we were there:

Lausanne

Together with a grad student, I presented on a project to map the 170,000 Farm Security Administration photographs taken during the 1930s-40s:

Photogrammar presentation
Thanks to Miriam Posner for this picture

I also co-presented a poster with Mats Malm from Gothenburg, about ways of surfacing related content in large digital literary collections:

Poster presenation
Thanks to Jenny Bergenmar for this picture

We were pretty busy during the week, but there was some time to view the sights in town, such as the Escaliers du marché:

Escaliers du marché

One of the things I am trying to do is take spur-of-the-moment street photographs, alongside more traditional tourist shots of city halls and churches. This couple was sitting in front of the (very impressive) entryway to the cathedral:

Outside the cathedral

Around the corner, a young girl was amusing herself by jumping off of a low wall:

Jumper

Link to this Post | Leave a Comment

UNLV Library

Had a great tour of the University of Las Vegas Lied Library, as part of the ALA’s annual convention. The highlight for me was seeing their work in building interactive exhibit walls out of multiple multi-touch displays:

UNLV Digital Collections

Although the hardware comes from a vendor, the software layer is all written in-house, and supports great features like multiple users zooming different photographs all at the same time:

UNLV Digital Collections

They’re using this system to highlight their special collections, which include fantastic information about local history:

UNLV Digital Collections

The Digital Collections team also has an iPad app for use in their exhibits:

UNLV Digital Collections

Overall this tour was a real wake-up call to think about what libraries can accomplish when they focus on their unique collections and think about presenting their material in new and more accessible ways. Special Collections reading rooms at many institutions can be rather intimidating places, with rules on how to handle delicate material. These rules are there for are reason, but they tend to discourage shoving artifacts around a table to juxtapose or compare. The mass digitization of Special Collections material gave new life to these items on our computer screens, but didn’t do much to let us physically manipulate these images: we struggle with resizing browser windows and spawning new tabs to get all our material situated on our 11” laptop screens.

The multi-touch exhibit panels I saw at Lied Library, when coupled with UNLV’s own software layer, point towards a future where multiple users can grab, drag, resize and otherwise physically manipulate artifacts on a large surface. Because the images are linked into the metadata in the digital library (ContentDM), there’s good contextual information about what you’re seeing — but it never gets in the way of the visual materials themselves.

Speaking of the visual, I was also struck by some great design work in a newly remodeled multipurpose room:

UNLV Library

UNLV describes its goals for this space as follows:

This space will serve as a state-of-the-art venue to showcase UNLV Libraries’ special collections and comprehensive records that document our region’s history—making them accessible to everyone to experience our past by touching and feeling these artifacts. This event space will serve as a center for academic and cultural dialog, panel discussions, readings and lectures by gaming fellows, authors and visiting scholars.

Link to this Post | Leave a Comment

Jef Raskin's Canon Cat


The ‘newest’ computer I’ve added to my collection is a 1987 machine designed by Jef Raskin, the Canon Cat. Based on the ideas of modeless text editing that Raskin had developed while at Apple, including the Swyft hardware and software enhancements to the Apple //, the Cat arguably represents the original vision for the Macintosh project.

Raskin is actually depicted in the Ashton Kutcher film Jobs, in a brief scene where Steve takes over the Macintosh team, unceremoniously ejecting the bearded and professorial Raskin from the team that Jef had led since 1979. The machine that emerged from the new, Steve Job-managed Mac team was very different that the minimalist appliance that Raskin envisioned: a high-resolution bitmapped display, a mouse, and sophisticated software that required larger amounts of RAM all pushed up the price to $2,500 at launch.

cat_ad.jpg

The Canon Cat, which came to market three years later, is the closest vision of what Raskin’s original idea for the Mac might have been. Raskin was able to extend the ideas of the “Leap” keys that he had pioneered on the Apple //-based Swyft systems, giving users two new meta-keys (LEAP FORWARD and LEAP BACKWARD) that, when held down during typing, zapped the user to the exact place in the text where those words occurred. With such a radical system of navigation, there was no need for a visible file system, or discrete documents in different windows — the Cat provided a scrolling window containing everything you had ever written (or at least as much could fit on a 3.5” disk.) The closest parallel today would be navigating a webpage by using the browser’s “Find on page” command.

Although such a system is a big cognitive leap from how most software (previously and since) worked, Raskin claimed that keeping all your writing in a big scrolling list would avoid several levels of cognitive abstraction that traditional GUIs required the user to master. Reading through the Canon Cat manual is interesting because — whether due to Raskin’s focus on simplicity and appliance computing, or the sponsoring corporation Canon’s historical focus on office equipment — one encounters a machine much more limited and simple than the Apple Macintosh, despite shipping three years later. This was a machine for office workers, writers, and others who only needed to manipulate text; Desktop Publishers need not apply. But that doesn’t mean there wasn’t room of innovation: after mastering the distinctive LEAP key system, users could select text and “compute” it using the built-in math functions, or select a phone number underneath a friend’s name and have the Cat’s built-in modem dial them directly. Restricting the functional domain of the computer down to the realm of A-Z meant that the user experience could be tightly honed, the computer booting in mere seconds and the screen instantly responsive with an image of the exact place you had left off typing.

There are some scans of the Cat’s manuals and marketing materials available from canoncat.net, and a great photo of Raskin himself using a Cat on the website of Computer History Museum.

raskin_Cat.jpg

Link to this Post | Leave a Comment

Alabama Digital Humanities Center

Back from a great trip to Tuscaloosa and the University of Alabama’s Digital Humanities Center. My first time on campus; the library is host to a great DH space and some interesting projects that touch on both local history and broader topics.

University of Alabama library

Link to this Post | Leave a Comment

Visualizing Texts on Maps

Just a sneak preview of a website I’m building — the idea is to have an algorithm read through a bunch of magazine articles, find place names, and map those places onto a city or region of a country. Then, when the user hovers over the place, a list of sentences appear on the right, showing the context for each occurrence. Each small red dot is a place mentioned in the text; larger blobs of color show concentrations of places.

gringopreview.jpg

Link to this Post | Leave a Comment

Hyde Park Transport, 1926

I was curious what it would look like to take this Rand McNally map of Chicago public transportation networks circa 1926 and overlay it on Google’s modern aerial photography of the city.

1926el.jpg

When I first came to Hyde Park in 1993, there was an El track along 63rd street. That part of the Green Line was torn down by the time I left in 1997.

1926el63rd.jpg

63rd is certainly lighter and more open, but the demolition of the overhead tracks has hardly spurred economic development. The colored lines of the 1926 map hover over a mostly-vacant streetscape of today.

1926el63rdclose.jpg

I did this experiment with ArcGIS (for rectification) and GeoServer (for serving the map tiles into Google Earth.)

Link to this Post | Leave a Comment

DHCS 2012

We just wrapped up the 2012 Chicago Colloquium on Digital Humanities & Computer Science (DHCS). Below is a picture from one of my favorite poster sessions, a team working on Classical Greek & Latin textual applications for both iOS and Android:

Parsing and Presentation of Digital Classical Texts on Mobile Platforms

More pictures from the conference are online here.

Link to this Post | Leave a Comment

topic-diary.png I’ve been doing some work recently on the 15th-century diary of Luca Landucci, a Florentine apothecary who kept a detailed diary for over half a century.

The text we’re working with has surely been normalized to a degree, but still represents a departure from contemporary Italian. This has an impact on text preparation — the process of stoplisting, or stripping out less-meaningful words to leave behind really interesting terms and phrases for analysis. Normally I just run some utilities that list the most-frequently-used words in a corpus (such as “the” in English) and use the results as a starting point for a stoplist of words to ignore. But I noticed in the case of Landucci’s diary that he begins almost every entry with a sentence that contains the name of a month. Because these months are really more metadata than data — a kind of in-line datestamp — I figured it would make more sense to strip them out and so set about adding the Italian names for all twelve months into the stoplist. But after re-running my analysis, I was confused to see the months still popping up. The answer, of course, was that Luca wrote in a kind of Tuscan dialect, which differs in small ways from modern Italian. I stoplisted “settembre”, but Luca’s “settenbre” made it through the filter intact, along with “giennaio” (modern “gennaio”). So one by-product of this project will be a custom stoplist for Luca’s orthography — perhaps broad enough to function on other Tuscan text of the 1400s, perhaps not.

Another part of the preparation process was to split up the diary into individual days. Since the text I received was not marked-up in any kind of TEI or XML, I had to figure out how to do this by hand. Luckily, Luca was remarkably consistent, starting every entry with “E a dì 8 aprile 1498…”, “E a dì 16 di febraio 1495”, “E a dì 7 detto”, and the like. I sliced apart the diary into chunks based on this pattern, using the Gnu project’s CSPLIT function:

csplit -k -f landucci_chunks/$i -z -b _%05d landucci.txt '/E a dì /' '{*}'

This left me with about 1,600 individual entries, of relatively uniform size. There were, of course, outliers — one epic entry approached 12k, and several dozen entries were only a few words long — a kind of Renaissance Twitter stream. I might consider going back and removing the very large diary entry, or splitting into a few chunks, at a further stage. But in general, I was happy with the distribution of size of the individual entries. As with so much else human culture, the varying length of Luca’s writing demonstrates a power-law curve:

diary-curve.png

Running Mallet’s topic modeling code on the resulting files was my next step. I chose twenty topics to begin with, since it seemed like a reasonable first guess.

diary-topics.png

I was pleasantly surprised with the cohesion of the results — some interesting patterns included a topic about that most famous Florentine family, the Medici:

diary-medici.png

Also nice to see was a topic I’ve (provisionally) labeled “Economics”, which goes into the intricacies of taxation, inflation and commodity pricing:

diary-economics.png

This is just the first cut of the data — I want to refer to a printed edition and figure out how many entries their “should” be, to see if it’s anywhere near the 1,580 that my chunking algorithm produced. And it would be nice to assign a ISO-format date (like 1459-12-21) to each entry, so that we could graph topic saturation over time. This might let us see how certain topics, such as economics, waxed and waned as a matter of concern from Luca. But even at this early stage, I think this project reinforces the appropriateness of diaries as raw material for topic modeling (cf, of course, Cameron Blevins’ fantastic Martha Ballard diary project.) Unlike novels and other forms of print culture, diaries are relatively easy to cut into logical pieces and — at least in the case of Martha Ballard and Luca Landucci — offer a fascinating glimpse of one writer chronicling events over a long period of time.

Link to this Post | Leave a Comment

HathiTrust UnCamp

"Correction Rules!"

Back from a two-day workshop at Indiana University, run by the HathiTrust Research Center folks. HathiTrust is, loosely, a consortium of universities and research libraries in the US which gave volumes to be digitized in the Google Books project. Though these volumes are all available on books.google.com (at least those out-of-copyright), the HathiTrust exists to ensure that duplicate copies are held by a consortium of all the libraries in perpetuity, in case Google isn’t around in a few decades.

The attraction to literary folks — or at least those of us with an interest in data mining — is obvious: tens of millions of books, all digitized in the space of a few years. The tricky question has always been: how do we get access to them, and what kind of algorithms can we run on a corpus of this scale?

Corpus stats

For small-scale projects in the past, many of us were content to build up infrastructure at our local institutions: a big server here, a metadata database there… I set up such systems when I was at UCLA, to work on the 19th Century Nordic-language corpus. This works fine for several hundred or thousand books, but doesn’t make any sense for projects at the million scale.

So instead, the future of large-scale text mining may look something like this:

Running against the HathiTrust corpus

The screenshot above shows me doing a word count of a bunch of Norwegian-language texts — the first item is actually an artifact of the beta-quality tokenizing code, a hyphen, but the rest are the words for it, I, and, so, etc. This is actually a hybrid model, where a python script goes out and gets a zipped objects, decompresses them, and then does the word counting on a local machine. Most users are likely to use a combination of such local analytics, coupled with large-scale (and somewhat less-frequently-run) examinations of large chunks of the collection. That’s what this picture below shows off — visualizations from the Mellon-funded SEASR/MEANDRE toolkit.

Epic Datawall

But regardless of the implementation details, we’re at the cusp of being able to do truly interesting things with out-of-copyright works from the 19th and early 20th Centuries. All we need is for groups like the HathiTrust to navigate a very treacherous landscape of eager literary folks and suspicious publishing industry lawyers. If they succeed, we could derive real insight into all the cultural output that’s been preserved for centuries, and digitized quite suddenly in the span of my own grad school career.

Occupy the Cyberinfrastructure Building

Link to this Post | Leave a Comment

About this Archive

This page is an archive of recent entries in the tech category.

More entries in tech: tech: January 2015 (1)
tech: December 2014 (1)
tech: October 2014 (1)
tech: July 2014 (1)
tech: June 2014 (1)
tech: March 2014 (1)
tech: March 2013 (1)
tech: December 2012 (1)
tech: November 2012 (3)
tech: September 2012 (2)
tech: August 2012 (2)
tech: July 2012 (2)
tech: June 2012 (1)
tech: February 2012 (1)
tech: January 2012 (1)
tech: October 2011 (1)
tech: September 2011 (1)
tech: August 2011 (1)
tech: June 2011 (1)
tech: April 2011 (4)
tech: February 2011 (1)
tech: November 2010 (2)
tech: July 2010 (1)
tech: June 2010 (3)
tech: March 2010 (1)
tech: February 2010 (1)
tech: December 2009 (1)
tech: October 2009 (1)
tech: September 2009 (1)
tech: August 2009 (2)
tech: July 2009 (10)
tech: June 2009 (2)
tech: May 2009 (2)
tech: April 2009 (3)
tech: December 2008 (1)
tech: November 2008 (2)
tech: September 2008 (1)
tech: August 2008 (2)
tech: July 2008 (1)
tech: June 2008 (1)
tech: February 2008 (2)
tech: January 2008 (1)
tech: September 2007 (2)
tech: August 2007 (4)
tech: July 2007 (2)
tech: June 2007 (3)
tech: July 2006 (3)
tech: June 2006 (1)
tech: March 2006 (1)
tech: January 2006 (1)
tech: December 2005 (1)
tech: October 2005 (1)
tech: September 2005 (3)
tech: August 2005 (3)
tech: July 2005 (7)
tech: March 2005 (2)
tech: February 2005 (3)
tech: January 2005 (1)
tech: December 2004 (1)
tech: September 2004 (1)
tech: June 2004 (1)
tech: April 2004 (1)
tech: March 2004 (1)
tech: February 2004 (1)
tech: November 2003 (2)
tech: October 2003 (1)
tech: September 2003 (1)
tech: July 2003 (3)
tech: June 2003 (1)
tech: May 2003 (3)
tech: April 2003 (8)
tech: March 2003 (5)
tech: February 2003 (6)

school is the previous category.

Find recent content on the main index or look in the archives to find all content.

Recent Activity

Sunday Jan 25
Friday Jan 23
Thursday Jan 22
Tuesday Jan 20
Monday Jan 19
Sunday Jan 18
Saturday Jan 17
Thursday Jan 15
Wednesday Jan 14