According to a fascinating story in this morning's Globe, researchers at Harvard and Google are applying computational tools to the complete text of 4 percent of the world's books, and some noteworthy patterns emerge. References to "influenza" spiked during global epidemics; "pizza" and "pasta" are now mentioned more often than "hamburger" and "steak."
[R]eferences to God have been dropping off since about 1830. People are becoming celebrities earlier in life now than in the past, but their fame is more fleeting as their names drop out of the lexicon. References to past years are dropping off more quickly as cultures shift their focus to the present. And censorship leads to discernible shifts in a person’s or event’s cultural footprint, as evident in tracking Tiananmen in Chinese books...
At their best, these sorts of word counts become a sort of zeitgeist-o-meter, letting scholars get the basic sense of what books are saying without the toil and tedium of actually reading them.
Of course, the general public already has access to similar, if less powerful, tools: websites that count the words in blocks of text, Facebook apps that purport to summarize your year by counting up the words in your status updates. Enjoyable as it is to imagine a year dominated by, say, "success" or "Maui" or "windfall," the more prosaic aspects of our lives quickly emerge. A friend of mine discovered that his most-used word of 2010 was... an em-dash?! If nothing else, you can work out what's on people's minds. This morning I ran the last week of postings on The Angle through a word counter. Other than words like pronouns, "and," and "the," the most common words included "book" (24 times), "Obama" (mentioned 15 times), "crimes" (11 times), and — um — "hate" (18 times).
The obvious objection here is that simple counts don't really capture what writers are saying with the words they use. I ran "The Adventures of Huckleberry Finn" through the same counter. Sure enough, a certain racial slur occurs more often (157 times in the singular alone) than even "river" (141 times) or "raft" (111 times), but those counts do a disservice to Mark Twain. For this and other reasons, the Harvard/Google research project may send a shiver up the spines of humanities scholars, who up to now have managed to resist the relentless tide of quantification that has washed over, say, economics and political science.
Then there's the question of whether the very existence of the Harvard/Google research will make people express themselves differently. Knowing that their words may some day be electronically tabulated, will people who hate, hate, hate, hate, hate, hate, hate, hate President Obama, the Obama administration, Obama's policies, Obama's record, and all the other things Obama says and Obama does keep repeating themselves for the historical record?