News your connection to The Boston Globe

Google to index works at Harvard, other major libraries

Google Inc. is expanding its territory from bytes to books.

The world's largest Internet search engine said yesterday it will soon begin a program to index books and bound periodicals in the libraries of Harvard University.

Google has struck similar deals with Oxford University, the libraries of the University of Michigan and Stanford University, and the New York Public Library. It's all designed to create a new kind of index that will let anybody quickly search the contents of millions of books.

Company spokeswoman Susan Wojcicki said the project is the fulfillment of a dream for founders Sergey Brin and Larry Page. "This is something the founders wanted to do before they even started Google," she said. "The mission of the company, from the day it started, was to organize the world's information and make it easily accessible."

But Google also hopes that its book search service will give it a major edge over rival search services, including an up-and-coming challenge from software titan Microsoft Corp. "Google has constantly over time always been increasing our search index," said Wojcicki. "Having a more comprehensive search engine . . . leads to, we believe, a better product." In turn, that means more visitors to Google's search service, which makes money by selling advertisements.

Wojcicki said the process would take several years. She refused to reveal how much the effort will cost, but said it will be money well spent. "We make the money back by offering a better search, a more comprehensive search," she said. For now, Google won't display ads alongside the search results for a particular book.

Charlene Li, a search engine analyst at Forrester Research in San Francisco, said the academic researchers who would most likely use the new service would usually not be interested in clicking on ads, and therefore Google would dilute the value of its ad impressions by displaying them on these pages.

But Li still figures the library service could be a moneymaker for Google. "If it's seen as a repository for all this great stuff it will attract more and more people," she said. And many of these people will use other Google services and click on the ads there.

Sidney Verba, director of the Harvard University library, doesn't mind if Google profits from the project. After all, Google will pick up the tab for digitizing the school's entire book collection. "It's large amounts of money," said Verba. "We could not afford it."

Verba said Google officials first proposed the idea two years ago. "My reaction, to say the least, was skeptical," he said. But Google came back a year later with a detailed plan that impressed Verba. The company wrote custom software for the job, and built its own scanning machinery, to handle delicate books without damaging them.

"It sounded in principle very good," said Verba, but "we wanted to do a pilot study first, to see if this would work." So Google and Harvard agreed to a test using 40,000 books. The trial should last about six months. If results are satisfactory, Harvard will likely agree to digitize its whole library. Some institutions in the program are less cautious than Harvard. Michigan and Stanford have agreed to have their entire collections digitized. The New York Public Library, however, has agreed to only a limited number of books digitized. And Oxford will make available only books published before 1901.

Indexing books isn't new to Google, which already offers a service called Google Print, in partnership with various publishers. When a user types certain terms-- "Romeo and Juliet," for example -- Google displays a link to a relevant book -- in this case, a Cambridge University Press edition of the play.

But the Google library project will take this concept much further. The company must use electronic scanners to capture digital images of every page. These images must then be run through "optical character reader" software, which can recognize the letters and convert them into electronic text. Because the software is imperfect, human proofreaders must go over the results. Then these billions of pages must be indexed, word by word, and the results made available for searching on the Internet.

Some books, such as the plays of Shakespeare, are no longer protected by copyright law. In such cases, Google can display the entire text of a book. But in cases where copyright still applies, the new search service will present one- or two-sentence snippets of text that include the search terms. This will enable a user to quickly discover all the books that mention a given topic and see the context in which the terms are used. That way, a researcher can quickly decide whether a book is worth reading.

Google will also make it easier to find a copy of the book. The service will feature a link to WorldCat, an online database which allows searches of thousands of university and public libraries worldwide. A user will be able to enter a zip code, and locate the nearest libraries with copies of the book.

The Google program may steal some thunder from Microsoft Corp., which yesterday unveiled new search software that creates an index of the data on a user's own computer, as well as providing access to Internet information. Gary Price, news editor of the online industry publication Search Engine Watch, said that Microsoft could well counterattack by launching its own campaign to index academic library data. "There's plenty of other large university libraries out there," said Price, adding, "a company like Microsoft sure has the cash."

Hiawatha Bray can be reached at

Today (free)
Yesterday (free)
Past 30 days
Last 12 months
 Advanced search / Historic Archives