11 Nov 2012

Topic-Modeling a Quattrocento Tuscan Diary

topic-diary.png I’ve been doing some work recently on the 15th-century diary of Luca Landucci, a Florentine apothecary who kept a detailed diary for over half a century.

The text we’re working with has surely been normalized to a degree, but still represents a departure from contemporary Italian. This has an impact on text preparation — the process of stoplisting, or stripping out less-meaningful words to leave behind really interesting terms and phrases for analysis. Normally I just run some utilities that list the most-frequently-used words in a corpus (such as “the” in English) and use the results as a starting point for a stoplist of words to ignore. But I noticed in the case of Landucci’s diary that he begins almost every entry with a sentence that contains the name of a month. Because these months are really more metadata than data — a kind of in-line datestamp — I figured it would make more sense to strip them out and so set about adding the Italian names for all twelve months into the stoplist. But after re-running my analysis, I was confused to see the months still popping up. The answer, of course, was that Luca wrote in a kind of Tuscan dialect, which differs in small ways from modern Italian. I stoplisted “settembre”, but Luca’s “settenbre” made it through the filter intact, along with “giennaio” (modern “gennaio”). So one by-product of this project will be a custom stoplist for Luca’s orthography — perhaps broad enough to function on other Tuscan text of the 1400s, perhaps not.

Another part of the preparation process was to split up the diary into individual days. Since the text I received was not marked-up in any kind of TEI or XML, I had to figure out how to do this by hand. Luckily, Luca was remarkably consistent, starting every entry with “E a dì 8 aprile 1498…”, “E a dì 16 di febraio 1495”, “E a dì 7 detto”, and the like. I sliced apart the diary into chunks based on this pattern, using the Gnu project’s CSPLIT function:

csplit -k -f landucci_chunks/$i -z -b _%05d landucci.txt '/E a dì /' '{*}'

This left me with about 1,600 individual entries, of relatively uniform size. There were, of course, outliers — one epic entry approached 12k, and several dozen entries were only a few words long — a kind of Renaissance Twitter stream. I might consider going back and removing the very large diary entry, or splitting into a few chunks, at a further stage. But in general, I was happy with the distribution of size of the individual entries. As with so much else human culture, the varying length of Luca’s writing demonstrates a power-law curve:

diary-curve.png

Running Mallet’s topic modeling code on the resulting files was my next step. I chose twenty topics to begin with, since it seemed like a reasonable first guess.

diary-topics.png

I was pleasantly surprised with the cohesion of the results — some interesting patterns included a topic about that most famous Florentine family, the Medici:

diary-medici.png

Also nice to see was a topic I’ve (provisionally) labeled “Economics”, which goes into the intricacies of taxation, inflation and commodity pricing:

diary-economics.png

This is just the first cut of the data — I want to refer to a printed edition and figure out how many entries their “should” be, to see if it’s anywhere near the 1,580 that my chunking algorithm produced. And it would be nice to assign a ISO-format date (like 1459-12-21) to each entry, so that we could graph topic saturation over time. This might let us see how certain topics, such as economics, waxed and waned as a matter of concern from Luca. But even at this early stage, I think this project reinforces the appropriateness of diaries as raw material for topic modeling (cf, of course, Cameron Blevins’ fantastic Martha Ballard diary project.) Unlike novels and other forms of print culture, diaries are relatively easy to cut into logical pieces and — at least in the case of Martha Ballard and Luca Landucci — offer a fascinating glimpse of one writer chronicling events over a long period of time.