Science, 269: 1354-1356, 8 September 1995

INDEXING THE INTERNET
The World Wide Web makes it simple to retrieve information from cyberspace. Now researchers are devising systems for tracking down the information in the first place

Harvest: Effective Use of Internet Information: Do-it-yourself indexing.

Two years ago, the riches of cyberspace started to pale on David Fib and Jerry Yang, electrical engineering students at Stanford University. To be sure, says Filo they were finding "plenty of cool sites and stuff" on the World Wide Web, the web of linked Internet sites that invites users to wander from one to the next with the click of a mouse. But when-ever they went looking for a specific topic, they soon got lost. "So we started this thing," Filo says, "that allowed us to quickly categorize sites as we came across them, and we put it out on the [World Wide Web]."

The result was called, for lack of a better name, "David and Jerry's Guide to the Web." As the service grew, however, Filo and Wang opted for an acronym. Filo says they considered "yet another something," and so they looked up words starting with "ya" and found Yahoo, which seemed to fit. "So we ended up calling it 'Yet Another Hierarchical Officious Oracle.' "Yahoo today indexes 60,000 Web sites that Filo and Yang consider noteworthy, sorted into 10,000 categories. That's just a fraction of the Web's tens of millions of documents and resources. Even so, Yahoo is considered one of the more complete guides to the Web.

And that sets the scale of the next challenge confronting the architects of the Inter-net: indexing its entire contents, so that a user seeking a specific piece ofinformation-from box scores in yesterday's Beijing Times to the first treatise on rural electrical lighting-can quickly hunt it down. As Robert Wilensky, head of a digital library project at the University of California, Berkeley, puts it, the goal is to "invert the Internet," opening the way for users to "browse by content, navigate through concept space, rather than having to find things based on where someone put them up on their home page."

By that standard Yahoo and its closest competitor, an indexing system called Lycos, developed at Carnegie Mellon University, are just the beginning. Already on the horizon are systems for gathering tens of thousands of documents a day and indexing them with the aid of programs that can classify their contents precisely. Other approaches would rely on specific user communities to compile their own indexes, which could then be linked to form a global guide to the Web.

In trying to devise such indexes, Internet architects are coping with one consequence of their own success. The point-and-click ease of access brought by Web browsers like Mosaic has sparked a boom in usage and in the volume of available material-and made tracking down anything specific far more difficult. "The amazing thing about the Web," says Bruce Schatz, a research scientist at the National Center for Supercomputing Applications, where Mosaic was developed, "is why it's so popular if it doesn't do anything useful. All it does is let you surf. It certainly doesn't let you solve problems, and it doesn't search out information for you."

The spider stratagem

One of the first attempts to solve the problem relied on anyone who posted a new home page to list the service in a central index-the "Mother of All Bulletin Boards," as its developers at the University of Colorado, led by Oliver MacBryan, called it. Users of the bulletin board could then search it for whatever subjects interested them. For the Mother of All Bulletin Boards to work, however, everyone posting documents on the Web had to know about it-and use it correctly. And even then the result would be a list rather than a subject index. "It was a good notion in principle," says Paul Ginsparg, a Los Alamos National Laboratory physicist who created an electronic preprint archive. "But it didn't really solve the problem."

MacBryan was also among the first to try another solution, one that didn't require users to take the initiative: send out a search and retrieval program, which he called the World Wide Web Worm, or WWW "It's the simplest thing you can do," says University of Colorado computer scientist Michael Schwartz: "Write a piece of software that reaches out across the network and retrieves as many Web pages as it can find and follows links in those pages to find other Web pages." At each page, the program records the address, known as the uniform resource locater (URL), and downloads part of the contents for indexing in a searchable database.

Since then, these search-and-retrieval programs, now called Web crawlers or spiders, have proliferated (to the distress of some people running Web sites-see box on next page). When Filo and Yang found Yahoo's popularity growing, for instance, they quickly added a Web crawler to their service. And at Carnegie Mellon, Lycos was born as a simple spider program created by John Leavitt, to which his colleague Michael Mauldin added an indexing program in the spring of 1994. "When it would fetch a document," says Leavitt, "it would create an outline or table of contents of the document by stripping out all the headers, and then it took the first 20% or 20 lines, whichever was smaller, as an excerpt or abstract. It also took a group of 100 words, which were statistically the most salient for the document, as key words" for indexing it. Because Mauldin ran the program on his workstation at night, he and Leavitt named it after the Lycosa family of wolf spiders, which are night hunters.

Lycos, which Leavirt and Mauldin have since transferred to a private company and licensed to Microsoft, now consists of a flock of spiders that have already indexed over a million documents and are adding new ones at the rate of 20,000 each day. In addition, it has gathered partial information about another 4 million Web documents referenced by the ones it indexes directly. Leavitt says Lycos now gets several million hits a week from users, who can search the database for key words and then, with a mouse click, go directly to the relevant documents.

Lycos's spiders may crawl too slowly, however, to keep up with a system unleashed this month by Berkeley's Eric Brewer. Brewer's system, which runs on four workstations networked together to form a parallel super-computer, controls crawlers that can bring in and index over 100,000 documents a day. At the same time, it can serve several million search requests a day.

But the caches of documents amassed and indexed by Web crawlers aren't a full answer to the challenge of making the Internet searchable, says Schatz. The problem, he says, is that such automatic indexing "is so haphazard." For example, says Berkeley's Wilensky, "the word 'film' is ambiguous." If a human indexer comes across the word "film" in a document, he or she knows whether it refers to a movie, photographic film, or dirt residue, and will catalog the document accordingly. A computer indexing program has no such intuition, so it simply tags "film" as a key word and leaves it at that; as a result, searching for "film" in the current automated indexes will bring up documents on all possible meanings of the word.

Wilensky's group is trying to solve this problem through a technique known as lexical disambiguation. The algorithm builds a reference database from a statistical analysis of the contexts in which a word is found in a wide range of documents. It can then compare a new keyword and its context with those in the database to choose the most likely meaning. "You learn, for instance, that the movie meaning of 'film' tends to occur a lot in contexts where words like 'actor' appear," says Wilensky. "'Soiled covering' tends to appear in contexts where you hear about 'dirt' a lot. When you have those associations, you can make pretty good guesses about the right sense of the word." That classification procedure could eventually allow crawlers to build huge subject indexes that would be searchable with a precision and efficiency beyond anything available today.

The human factor

This sensitivity to context demands large amounts of computing time, and even then, the approach isn't likely to be capable of subtle cataloging judgments anytime soon. As a result, some computer scientists are looking for ways to harness human expertise to build high-quality Web indexes. Schatz, for instance, predicts the creation ofcommunity repositories maintained by particular disciplines-groups ranging from martialarts enthusiasts to computer scientists. "Once you have that kind of community, explains Colorado's Schwartz, "you can have somebody who knows the subject, who is responsible for making decisions on content." And in turn, a properly linked array of community indexes could serve as a loose-knit index covering all subjects.

For information scientists, the challenge is developing software that allows each community to choose and retrieve documents for a do-it-yourself index. Schwartz's Harvest Program at the University of Colorado is considered the most ambitious attempt so far. The Harvest software, he says, makes it "pretty easy" for anyone "to list a set of uniform resource locaters and documents they want to have included in the index." The software then retrieves the data, extracts the content for indexing purposes, builds the index, and handles queries.

To make the program even more efficient, Schwartz and his colleagues have designed Harvest to be split into two distinct parts, a gatherer and a broker. The gatherer resides at a remote site, retrieving documents to be indexed, while the broker stays on the home machine and builds the index. "Imagine a scenario," says Schwartz, "where you're trying to index a lot of information at NASA, for instance. You can start distributing the process by putting gatherers on all the machines where the data are. Each gatherer can then extract data much more efficiently, because it doesn't have to go across the network to do it. ... Instead of a gatherer sitting on one machine reaching out, you can have gatherers on 100 different machines, each one boiling down the information locally and then sending it across [to a single broker]."

The project was started in late 1993, and Schwartz estimates that Harvest users scattered throughout the Web have put together almost 1000 independent indexes. To link them, Schwartz and his colleagues plan to create software that will allow each distinct index to have pointers leading to other Harvest indexes. And the Harvest researchers are also working on indexes that will support not just documents but actual scientific data, linked to programs that can interpret the data to produce meaningfal information. "Right now the Internet is mostly used for data that humans look at," says Schwartz, "Eventually you want to build systems in which programs actually go along, collect data together, and do computations on it." Such indexes will do more than sort through what's on the Internet; they will make sense of it,

But in spite of the best efforts of Schwartz and other indexers, says Robert Kahn, one of the original architects of the Internet, who is now with the Corporation for National Research Initiatives in Reston, Virginia, there won't be any single best way to index the Internet. The solutions, he suggests, will be "technology dependent and sociology dependent," balancing such issues as whether the desired information is distributed throughout the Internet or is more concentrated and whether users care most about reliability, completeness, or speed.

"It's hard to generalize," he says. "It's like discussing transportation systems. You can ask me what I think will be the best transportation system in the friture, and I'll say it depends on what you want to do and where you want to go.

-Gary Taubes

 

   
... ... ...