Recently in tech Category

I started writing this post about making order-of-magnitude estimates, but it turned into more of a question of what counts as a “hit” when we search through scanned books.

Part I: Estimating Numbers

A few months back I had a question: how many volumes does Google Books actually have in the mainland Nordic languages?

It turns out answering this question is a little tricky. While you can constrain a query to a given language in the “Advanced Search” function of Google Books, you can’t just return a list of all books in that language. If you try to leave the “Search For” field blank, in an attempt to return a list of all books, you just get redirected to the home page of google.com, with nary an apology. So you have to search for something — anything — that will trigger at least one hit per book.

What can we search for that will be in every published volume in a language? The words for “I” wouldn’t necessarily work — there might be a history book without any dialogue. Searching for “said” would return nearly every novel, but very few plays.

What I settled on was to search for indefinite articles — the equivalent of “a” and “an” in English. (Definite articles in these languages are suffixed onto the noun, so that wouldn’t work.) Modern Nordic languages (except some kinds of Norwegian) have two grammatical genders, endearingly called neutrum and utrum in Swedish, or neuter and common-gender. (Yes, in the Swedish language at least, gender has indeed collapsed.) With some minor orthographic variations, these work out to “en” and “ett” in most cases. To sum up: if we search on either of these two words, we should get our one required hit in 99.999% of all texts.

(As an aside, you may wonder whether searching for one indefinite article or the other influences the search results. The answer is it does, but at very small levels that I suspect may be noise, or slightly-out-of-date indexes.)

To generate a quick and dirty graph, I chose to chunk the results in decade-long sections. I figured this would give me a sense of change over time, without making me do 100 separate queries times three languages.

So! What are the results of this admittedly unscientific survey?

full-text_scandinavian_books_digitized_per_decade.gif

First off, we should say that all publishing does not cease in Scandinavia around 1920. I restricted my queries to full-text-available only, which essentially means no longer in copyright. Between libraries’ decisions to only scan out-of-copyright texts, and the search engine obeying my command to only display full-text books, the data just appears to go to zero.

Secondly, the question of which collections these books came from. In spot checking some of these results, I’ve found books from everywhere from Harvard to the NYPL to the Bavarian State Library. But we also know that the Royal Library in Denmark (which serves as both the national library as well as the library of the University of Copenhagen) has an agreement with Google Books to digitize parts of its collections. (See this presentation for more details.) So the very noticeable spikes in the Danish material may be an artifact of one library giving Google everything from 1900-1910 as a test, for example. We would need to compare this graphs to an actual history of publishing in the three Nordic countries to know how well it maps to what was actually printed.

All told, however, my results seem to suggest that Google Books had roughly 160,000 full-text volumes in Norwegian, Danish and Swedish as of March 2010. Keep in mind, that’s texts of all types, including things like church records, some serials, and no doubt a lot of ephemera.

Part II: Results and Context

In the time since I did this quick and dirty analysis, Google has deployed a refinement in their search system — an attempt to weld together their discrete searches for media such as videos, images, books, status updates, etc. But this change has had a negative effect on our ability do conduct even approximate counting such as that shown in the graph above.

Most commonly called “The New Sidebar” in online discussions, this feature seemed to deploy in a distributed fashion between November 2009 and May 2010. Either as part of this update, or during roughly the same timeframe, Google Books stopped counting the number of volumes with a given word you searched for, and started counting the number of hits. So for example I can now tell you that the word “en” was used 83,000 times in Swedish-language books in the year 1900, but I can’t tell you how many books that represents. (Clearly, Sweden did not publish 227 books every day that year, or I would have never gotten through my General Exam reading list.)

The new system is, undoubtably, more useful for certain things — like finding the frequency of term occurrence over time. But unless I’m missing some big obvious button somewhere, we have lost the ability to count works that include a word or phrase. The only way we can get the total number of works, it seems, would be to go to the end of all the results page for each query, count the total pages and then multiply by 10, for the number of results on every page.

Finally, it’s worth noting that the accuracy of any of Google Books’ metadata has been the subject of some debate. Even when the system reported the number of works with a given search term, such as the chart above, I was trusting both the publication date as well as the language fields. Geoff Nunberg’s investigation and critique of precisely these fields has received a lot of attention last year, however what I’ll link to is the interesting post by Google’s Jon Orwant in the comments section below. My feeling is that in my own field, Scandinavian Literature, we may be facing questions of metadata quality that are as-of-yet unexposed, because of the much smaller number of people who read these languages compared to English.

Still, there’ no doubt that eventually Google Books will be a scholarly resource of first measure, irrespective of whether it was originally intended to be or, indeed, is presently run to be. What will help is input and feedback from people working in literature (and corpus linguistics, to name a field with several decades’ more experience with problems like these.) Sussing out what constitutes a “hit” — linguistic lemma or published volume — is one of the more interesting questions we’ll all have to think about in the future.

Cisco Cius

Part of CEO John Chamber’s speech here at the annual Cisco Convention was a surprise product announcement — a new business-focused tablet computer based on Google’s Android operating system. You can read more coverage of the intro from Engadget and Gizmodo, but on the show floor itself the new device was imprisoned behind glass:

Cisco Cius

Despite the business focus of the product itself, the devices’s unveiling during the keynote used primary education as the context. Actors portraying students, parents and teachers put the tablet to work pitching the video-conferencing and e-textbook capabilities. (The latency of a satellite hookup to research vessel scotched the dream of seamless telepresence during the demo, unfortunately.)

But whether Cisco chooses to focus on the classroom or the boardroom (or both), several questions remain about its entry into a crowded tabloid market. Non-phone devices based on Android have had a rocky road to travel getting the key differentiator of that operating system — the open Android market — to work. Companies that have brought Android-based tablets to market, such as Archos, have found themselves both stuck with older versions of the OS, as well as locked out of the vibrant Marketplace — a software distribution system much more open and less controlled Apple’s App Store, but paradoxically unavailable for any device Google refuses to authorize.

Put another way: take away the apps that Google requires co-branding for (Gmail) and won’t allow tablets to use (Marketplace), and you end up with a much less compelling story for a competitor to the current market leader, Apple’s iPad. Though Cisco’s expertise in enterprise features such as IP telephony and video telepresence can make the Cius a well-fitting cog in a corporation’s existing IT infrastructure, users may wonder why they’re kept out of the dynamic and ever-growing Android software marketplace for seemingly arbitrary reasons.

How does a Fortune 500 company present its CEO to 10,000 customers during an economic downturn?

John Chambers keynote

With a lot of confidence, apparently. Despite backing the wrong horse in 2008, John Chambers is bullish on the economic recovery, as any CEO whose bottom line depends on expanding businesses would be.

John Chambers keynote

There’s evidence he has good reason to believe in Cisco’s performance. Since heralding a bold new assault on the data center last year — going to battle against systems integrators such as Dell and HP — Cisco has proven itself an unexpectedly strong competitor to traditional hardware companies in selling integrated server systems, incorporating everything from CPU to disks to, naturally, the routers and switches.

Cisco Live 2010

Instead of building all these elements themselves, Cisco has partnered with vendors such as EMC and their subsidiary VMWare. Put the server, disk subsystem, and network switch into one box and you end up with a “VBlock”, which Cisco will ship to your door ready to go — the Lunchables of the data center:

Cisco Live 2010

More interesting than the back-office equipment, however, is Cisco’s new play for a “business tablet” — the Cius. I’ll cover that in my next post.

Cisco Live 2010

Returning to Cisco’s annual convention this summer (after a one-year absence) finds me in Las Vegas during 109° weather and reminds me that Nevada-based tech conferences are much more enjoyable in March than in the beginning of July. Luckily the hotel (Luxor) connects indoor to the convention center (Mandalay Bay) through indoor passages, through which one can walk and pass by oversized styrofoam logos such as these:

Cisco Live 2010

NeXT Software Release 1.0a

When I was in middle school, around 1987-88, a kid brought in a strange plastic box that contained a shiny round disc, rattling around within it. His father worked at NeXT, down in Silicon Valley, and he had given his son a (very) early example of the infamous Canon optical disc that would define the first NeXT computer.

Steve Jobs’ vision for the future was simple: without any other kind of permanent storage, users would keep their entire universe of files and operating system on a disk like the one seen above. They could move from machine to machine, taking with them hundreds of megabytes of digital files. This was a very neat concept in the late 80s, when hard disks were expensive and small. Manufactured by Canon, the magneto-optical drive would be one of the Cube’s defining aspects. A perusal of the underlying technology is rewarding — as the name implies, the discs combined magnetism and the optical spectrum in a unique way: a laser heated part of the disc to the curie point, a temperature where the polarity can be flipped by electromagnets. That same bit can then be read by the laser (at much lower intensities) and its value determined by the varying properties of how light reflects from magnetized materials. Unlike traditional Winchester drives, they were as immune to dust as an audio CD, which made them perfect for transportation.

The reality didn’t quite live up to the promise — the optical system was four times as slow writing to the drive as it was reading from it, which mean that when virtual memory (always on in the Mach system) paged out to disk it brought the machine’s performance to a halt. NeXT eventually included small, inexpensive 40megabyte hard drives in the cubes, preconfigured as swap space (and indeed too small to hold the operating system by itself.) I suspect the most useful cubes had the optional 330 or 660mb internal SCSI drives installed. Certainly by the time the Motorola 68040 had replaced the ‘030, the optical disc system had become more of a curiosity than a feature, and the last variant of the cube (NeXTDimension Turbo) actually removed support for the oddball device.

NeXT Software Release 1.0a

The ‘040 cube I had in Chicago in the mid-1990s still had an optical drive in it, but like most other examples, it had long since stopped functioning. There were a plethora of problems with the device, including an unfortunate cooling design of the cube itself which led to sensitive lenses and other components being coated by dust rushing into the unguarded disc slot. (NeXT eventually recommended reversing the flow of the fan at the back of the cube, so that air was pushed out of the slot, not in.) But even ‘new old stock’ optical drives have been known to be defective, possibly due to off-gassing of plastic components dulling actuator positioning templates.

It’s anyone’s guess how many functioning Canon optical drives exist at this point. Though the Cube used the standard SCSI connection for drives both internally and externally, the connection between the optical drive and the motherboard was proprietary, and declaimed as such in the the technical information NeXT published about its own hardware.

IMG_0293.jpg

Interoperability standards for magneto-optical drives never really existed during the timeframe that the Cube was sold — even Mac users had to stick to one vendor for both disc and drive — and so it’s doubtful that any other kind of drive could read these orphaned discs. Were such a drive to be discovered, recent progress in emulating NeXT’s UFS file system under Mac OS X (its distant descendent) have at least made the files themselves theoretically recoverable.

Whether browsing for technical books on Amazon, or during the long process of writing a dissertation, I often come across references to volumes online that I’d like to see if I could get from my university’s library. Thanks to the explosion of computerized databases, the ISBN of these books is oftentimes exposed right alongside the title and author. And given the similar titles of many books, having that ISBN number is a great way to go directly to the volume I want, rather than sorting through a list of similar-sounding titles returned by a keyword search. (Ever try to find one specific book on iPhone programming?):

iphonebooks.png

Because the ISBN number works so well as a unique lookup, I thought that I could save some time by automating the lookup for an arbitrary ISBN against WorldCat, the union catalog of many of the world’s research libraries. In Mac OS X Snow Leopard, there’s a super-easy way to do this, by using some 10.6-specific ways to extend the Services menu.

(“Services” is a terrific, if underutilized, underpinning of OS X which actually dates from NextStep:

nextstep-services.png

The idea on the NeXT was that applications could deliver certain services across the system — say, a spelling checker, or an equation solver — by populating a menu and acting on contextual selections. Services have undergone a lot of changes since 1988, and in the latest version of OSX, 10.6, they’ve experienced a renaissance due to the inclusion of Automator-generated AppleScript workflows.)

The robot that will make this all easy is called Automator, who lives in the Applications folder:

automator.png

Automator uses a combination of tools drawn from the entire OS, from the circa-System 7.1 scripting language called AppleScript, to system-level affordances such as the Clipboard, to whole standalone applications such as Safari. Automator lets you put these all together at an extremely high level, taking care of the ‘glue’ that binds each part together. If programming in Assembly is the equivalent of atomic engineering, and Objective-C is working at the level of cells, then Automator lets you go all the way past Lego and into the comforting, easy-to-grab world of Duplo.

What we want Automator to do for us is to act upon any ISBN we highlight with the text cursor, and pass that number off to WorldCat’s ISBN resolver. (Which, conveniently enough, is worldcat.org/isbn/…) Here’s what the entire Automator Workflow will look like when it’s done:

automator-whole.png

To get started, we’ll fire up Automator and choose Service from its initial list of different kinds of templates:

automator-service.png

Before we add anything to the resulting blank box, we need to tell Automator the general terms of the Service we’re building. Luckily, the defaults are spot-on: our ISBN lookup script will accept text in any app (allowing you to use it in Word as well as a web browser or email client) and won’t replace what you’ve selected. (A more advanced Service might replace a highlighted ISBN with a full formatted citation — but that’s a good bit more complicated).

automator-text.png

Now we can begin dragging elements into the large blank area on the right, as if they were big plastic blocks. We’ll start with an some code to prefix the WorldCat ISBN resolver’s URL onto the text that Automator receives from the user, so go to the Utilities folder and select Run AppleScript:

automator-runapplescript.png

This gives us a container that can handle any AppleScript code we type in. What we need AppleScript to do for us is refreshingly simple, and the language’s syntax is extremely human-readable:

on run {input}

set QueryURL to "http://www.worldcat.org/isbn/" & input

end run

All we’re doing here is constructing a variable to pass off to a web browser that consists of two parts: the WorldCat ISBN resolver, and whatever the Service was given when the user selected text and invoked it.

Next step: pass that full QueryURL off to Safari (or whatever the default browser is on the system.) This step is embarrassingly easy: drag the Display Webpages block from the Internet folder in the Library:

automator-dispweb.png

In the words of Jeff Goldblum: there’s no step 3. Save your file as “WorldCatISBN” or similar:

automater-save.png

Automator will take care of installing the Service for you (behind the scenes, it’s putting it into your username/Library/Services folder.

Now it’s time to test it out. Navigate to a page on Amazon for a random book. Scroll down to the “Product Details” section, and drag your cursor over either the ISBN-10 or ISBN-13. (WorldCat will handle either form.) Right-click (or hold down Control and click) and select “WorldCat ISBN”.

automator-contextmenu.png

(What’s especially nice about Services in 10.6 Snow Leopard is that they are contextual — the WorldCatISBN menu won’t show up if you’re selecting pixels in Photoshop, or an icon in the Finder. Of course, there’s nothing to prevent you from selecting an invalid ISBN and sending it to WorldCat, or any gibberish text, but you won’t cause any problems doing that either.)

A new window will open with the results found from WorldCat’s library listings, geo-located to show the library nearest you with the book in their collection.

The end result? That programming book you thought you might have to pay $50 for may well be available for free on campus. And you can check if it is with one right-click.


Optional: Customizing your WorldCat URL

More and more libraries are using a facet of WorldCat called WorldCat Local to replace their own internal catalogs. This is true of my institution, the University of Washington — we use http://uwashington.worldcat.org/. These URL’s have a few advantages over using the normal site, including telling WorldCat that you presently have (or could log in to if necessary) databases and other subscription resources requiring authentication, and also setting the default scope of queries to buildings on campus and local consortia. So it may make sense to customize the WorldCat URL in the AppleScript above for your local institution. Everything after the worldcat.org behaves as normal, so I would use “uwashington.worldcat.org/isbn” in my code and get the same results from my ISBN queries.

PowerBook 520

| No Comments | No TrackBacks

PowerBook 520 plastic parts

Restoring a circa-1994 PowerBook 520. This is the grayscale 25Mhz 68LC040 model, offering sixteen shades of gray on the screen.

PowerBook 520

PowerBook 520

The vain hope that a 16-year-old battery will hold a charge: the most amusing part of the whole process.

PowerBook 520

After western Maryland and DC I set off to Las Vegas, where the big annual convention of Photoshop users is taking place.

photoshopworld 2009

This is sort of hard to explain, but the theme for these Photoshop conferences has always been “NFL Football” — so all of the materials and publicity and big opening keynote are styled like a professional football game. It’s hard to believe there’s that much overlap between art-school graduates and people who were QB’s in highschool, but there you go:

photoshopworld 2009

Most of what we were doing during the convention was sitting in darkened rooms looking at color correction techniques, which ironically aren’t very photogenic. So instead here are some pictures from a photo walk on foot around the Strip:

camera

gold shoe

sunset

The highlight of the trip for me was winning a copy of Jeff Schewe’s new edition of Real World Image Sharpening with Camera Raw, which I got by being the first person to shout out Photoshop author Thomas Knoll’s home town (Ann Arbor, in case anybody attends next year.)

The Grasshopper

When I was a kid, I had a remote-controlled car called the “Frog,” made by a Japanese company called Tamiya. The company had their origins in scale modeling, and the Frog was actually a high-precision kit that you assembled out of dozens and dozens of small plastic and metal parts. Painting and decaling the model was just as important, and took just as much time, as putting the gears in the differential together, or assembling the oil-dampened shock absorbers.

That Frog is long gone now, but Tamiya has been re-releasing some of their classic R/C kits from the 1980s. Pictured above is the Grasshopper, which was a more basic buggy than the Frog I had as a kid. Still, it was a fun kit to build this month, partly because it provided some context to the design decisions in the more-sophisticated successor model, the Frog I used to own.

One thing that’s changed in 23 years: I now enough patience (if far less free time) to attempt to paint the optional driver figure. My guess is that most kids left these out in the 1980s, as they required extremely careful masking and painting. If you look carefully you’ll see that the figure’s eyes and pupils have to be painted in. As of yet, he doesn’t have any eyebrows — but a can of dark brown acrylic paint just arrived in the mail.

Front

Back in June, Epson sent me their largest inkjet printer that doesn’t come with it’s own furniture stand: the Stylus Photo 4880. I had won this monster inkjet device in a freak raffle accident up in Vancouver, during Epson’s Printer Academy training workshop, during which a bunch of people interested in digital photography hung out at the new convention center and admired the nearby scenery:

Vancouver Convention Center

It ended up being delivered right before I left for Europe, so I had to wait till now to unpack it. This involved removing a whole wooden pallet from the base and dealing with a box so big it dwarfed my dining-room table:

Epson 4880 Box

There’s quite a bunch packed inside — the ink cartridges it ships with alone would cost $1,000 to replace!

In box, from above

Aside from lucky contest winners, most people who get this printer intended to share it with a larger workgroup on a local network:

100baseT

As the printer works on media up to 17” wide, most users will also use roll paper with the built-in cutter. Slightly nervous about a printer with a robot-operated razor blade, but we’ll see. The big thing at the back can handle rolls nearly heavier than I can lift:

Top

The leather manual is quite a bit thicker than the one that explains the car I drive:

Leather-bound manual

All in all, couldn’t ask for a better toy to print out pictures from this summer on. Thanks, Epson!

The UW Library has announced that sound engineer Jim Anderson’s collection of recordings from the previous incarnation of the Crocodile Cafe will be available on August 11th for perusal in the Odegaard Library Media Center. No online streaming for now, due to copyright restrictions, but a great chance to hear one of the hundreds of tracks from bands both great and obscure (sometimes both) in-person at the library.

UW Press Release

Workshop

What good is an instrument if you have nothing to play it with? Jean-Claude, our savior, did the hard work of expertly carving the pernambuco hardwood into the right shape for a nyckelharpa bow, which is a smaller device than the violin bow:

Pernambuco brazilwood

In order to hold its shape correctly, the bow wood must be heated briefly in a hot (alcohol) flame:

Pernambuco brazilwood

Now both key elements, the bow and the "frog" which holds the bowhair in place, are ready:

Frog

A small wire will hold the frog in place, while allowing for adjustability:

Bow

And what would a bow be without hair? As with the animal bone earlier, we were grateful for the language barrier that allowed us to not ask very many questions about what had happened to the horses that these tails were previously connected to:

Workshop

If you've never stretched hair on a bow before, be glad. All the hairs have to be exactly at the same tension and position laterally. Be sure not to crush the expensive pernambuco wood in the vise, either. There's a reason I was too busy to take very many pictures during this task.

Bowmaking

Recent Activity

Tuesday Aug 31
  • Peter tweeted, "voucher-for-bumping = impromptu subsidized email catchup at LAX."
  • Peter is returning from a trip to Los Angeles, CA.
Thursday Aug 26
  • Peter tweeted, "is about to present on citation networks in Old Norse studies as part of the wind-up of the NEH #humnets workshop."
Tuesday Aug 24
Monday Aug 23
  • Peter tweeted, "Great talk from David Blei (Princeton) on Relational Topic Models. Instead of word overlap, use lower-dimensional representation. #humnets"
  • Peter tweeted, "Really enjoying Krytzof Urban's talk on word space models and keyword search at #humnets."
Saturday Aug 21
  • Peter is planning a trip to Los Angeles, CA in September 2010.
Saturday Aug 14
Friday Aug 13
  • Peter is planning a trip to Tokyo, Japan in November 2010.

Bilder

Flash Required