In billions of words, digital allies find tale

Google, Harvard join in book study

By Carolyn Y. Johnson
Globe Staff / December 17, 2010

E-mail this article

Invalid E-mail address
Invalid E-mail address

Sending your article

Your article has been sent.

Text size +

Mining the complete text of 4 percent of the world’s books, Harvard University and Google researchers used a powerful new tool unveiled yesterday to glean surprising insights into language, culture, and history.

Books already tell stories, but when their words are combined and analyzed with computational tools, they tell bigger tales. By studying billions of words that appeared in books published over the last 200 years, the researchers found that references to God have been dropping off since about 1830. People are becoming celebrities earlier in life now than in the past, but their fame is more fleeting as their names drop out of the lexicon. References to past years are dropping off more quickly as cultures shift their focus to the present. And censorship leads to discernible shifts in a person’s or event’s cultural footprint, as evident in tracking Tiananmen in Chinese books, or the Jewish artist Marc Chagall in German books from the Nazi era.

The findings, fruit of the ambitious Google project to digitize every book in existence, were reported yesterday in the journal Science. They are a tantalizing first glimpse at what researchers think may become a transformative new tool for humanities researchers.

Google is publicly launching the tool, Google Books Ngram Viewer, to allow scholars or the simply curious to ask questions, such as when references to “The Great War,’’ which peaked between 1915 and 1941, were replaced by “World War I.’’ The tool allows people to look up words or phrases that range from one to five words, and see their occurrences over time — the frequency that a word is mentioned in a given year divided by the total number of words written that year.

“This is really the largest data release in the history of the humanities — a fantastic wealth of data,’’ said Jean-Baptiste Michel, a postdoctoral researcher in the program for evolutionary dynamics at Harvard. “In our paper we present our initial investigation — we explore this new terrain, we dig a little bit. It is a very cool feeling to have, but what people will be able to do will far exceed everything we have done.’’

In this analysis, the researchers used the data set to look at changes in grammar and English, finding that about half the words that appear in books are “dark matter’’ that do not appear in dictionaries — words that may be compound constructions or proper nouns, or just are undocumented, like “aridification’’ or “slenthem.’’ English, they found, is growing by about 8,500 words a year.

They have also looked at collective memory — and forgetting. Authors are letting the past go more quickly. The year “1880’’ had dropped to half its maximum frequency of references 32 years later, in books written in 1912. But it took only a decade for “1973’’ to decline to half its prominence.

Researchers found that use of the word “women’’ has been rising for 200 years, and began to eclipse mentions of “men’’ around the mid-1980s. And the frequencies of “pizza,’’ “pasta,’’ and “ice cream’’ have soared since the 1970s — food for thought, given that the childhood obesity epidemic started at about the same time.

The study, led by Michel and senior author Erez Lieberman Aiden, who runs the multidisciplinary Laboratory-at-Large at Harvard’s engineering school, drew on a wide array of collaborators, not only from Harvard and Google, but also from Encyclopaedia Britannica and the American Heritage Dictionary.

Michel and Lieberman Aiden had worked together on a 2007 study in the journal Nature that tracked the evolution of language through a much more painstaking process — hunting down obscure old books and reading them to discover the linguistic heritage of modern verbs. They began to notice the growth of Google Books, the initiative that has now scanned 15 million volumes — more than 10 percent of all published books, according to Jon Orwant, engineering manager of Google Books, which has a large presence in Cambridge.

Seeing that their research techniques would soon be antiquated, Michel and Lieberman Aiden approached Google and began a collaboration.

“As we’ve amassed more and more information that isn’t available elsewhere, I started to realize we’re sitting on these troves of data that are very useful,’’ Orwant said. The value, he said, is not just for Web users searching for answers to specific questions, but to scholars, too.

The efforts are part of a much broader push to bring the power of analyzing large data sets to the increasingly digitized world of humanities research.

“If you look at what humanities scholars have studied for hundreds of years, they tend to study things like books, music. The difference today is those are digital and you have the potential of searching and ‘reading’ much larger amounts of this information than you ever could before,’’ said Brett Bobley, director of the office of digital humanities at the National Endowment for the Humanities.

Such tools would not supplant humanities’ researchers current methods, Bobley said. But they could supplement work and broaden the scope of research questions, which are limited by how much people can read and remember.

Researchers calculated, for example, that just reading the books from the year 2000 in the two-century data set used in the Science paper would take 80 years — without interruption for meals or sleep.

Franco Moretti, co-director at the Stanford Literary Lab, praised the methods and the findings of the study. Going forward, digital humanities researchers have increasingly powerful tools, but the challenge will be interpretation — finding links between quantity and meaning.

“Just as it makes an enormous difference [for paleontologists] whether a bone fragment belongs to a creature’s tail or neck, so it makes a great difference whether the word ‘God’ . . . occurs as a self-explaining given, in a discussion of principle, or as a banal interjection; whether, in a play, it is used more often in soliloquies, love duets, or public scenes; and so on,’’ Moretti wrote in an e-mail.

Carolyn Y. Johnson can be reached at