Business your connection to The Boston Globe

Google to index works at Harvard, other major libraries

Page 2 of 2 -- "It sounded in principle very good," said Verba, but "we wanted to do a pilot study first, to see if this would work." So Google and Harvard agreed to a test using 40,000 books. The trial should last about six months. If results are satisfactory, Harvard will likely agree to digitize its whole library. Some institutions in the program are less cautious than Harvard. Michigan and Stanford have agreed to have their entire collections digitized. The New York Public Library, however, has agreed to only a limited number of books digitized. And Oxford will make available only books published before 1901.

Indexing books isn't new to Google, which already offers a service called Google Print, in partnership with various publishers. When a user types certain terms-- "Romeo and Juliet," for example -- Google displays a link to a relevant book -- in this case, a Cambridge University Press edition of the play.

But the Google library project will take this concept much further. The company must use electronic scanners to capture digital images of every page. These images must then be run through "optical character reader" software, which can recognize the letters and convert them into electronic text. Because the software is imperfect, human proofreaders must go over the results. Then these billions of pages must be indexed, word by word, and the results made available for searching on the Internet.

Some books, such as the plays of Shakespeare, are no longer protected by copyright law. In such cases, Google can display the entire text of a book. But in cases where copyright still applies, the new search service will present one- or two-sentence snippets of text that include the search terms. This will enable a user to quickly discover all the books that mention a given topic and see the context in which the terms are used. That way, a researcher can quickly decide whether a book is worth reading.

Google will also make it easier to find a copy of the book. The service will feature a link to WorldCat, an online database which allows searches of thousands of university and public libraries worldwide. A user will be able to enter a zip code, and locate the nearest libraries with copies of the book.

The Google program may steal some thunder from Microsoft Corp., which yesterday unveiled new search software that creates an index of the data on a user's own computer, as well as providing access to Internet information. Gary Price, news editor of the online industry publication Search Engine Watch, said that Microsoft could well counterattack by launching its own campaign to index academic library data. "There's plenty of other large university libraries out there," said Price, adding, "a company like Microsoft sure has the cash."

Hiawatha Bray can be reached at 

 Previous    1   2
Today (free)
Yesterday (free)
Past 30 days
Last 12 months