|
...easy-to-use, intelligent, internet robot that builds a link directory and creates link trades for you! |
|
MOST OF WEB BEYOND SCOPE OF SEARCH SITES
or
SEARCHING THE WEB WON'T GET YOU FAR
By Ashley Dunn, Times Staff Writer
The Los Angeles TimesAs appeared in The Palm Beach Post
Thursday, July 8, 1999
If searching the World Wide Web for that one nugget of information already seems like a bad trip into a quagmire of data, Internet researchers have a bit of bad news for you - the situation is only getting worse.
Even the most comprehensive search engine today is aware of no more than 16% of the estimated 800 million pages on the web, according to a study to be published today in the scientific journal Nature. Moreover, the gap between what is posted on the web and what is retrievable by the search engines is widening fast.
"The amount of information being indexed (by commonly used search engines) is increasing, but it's not increasing as fast as the amount of information that is being put on the web,"said Steve Lawrence, a researcher at NEC Research Institute in Princeton, N.J., one of the study's authors.
The findings are important because they raise the specter that the internet may lead to a backward step in the distribution of knowledge at a time of technological revolution: The break-neck pace at which information is added to the Web may actually mean that more information is lost to easy view than made available.
The study also underscores a little known feature of the internet. While many users believe that Web pages are automatically available to the search programs employed by such sites as Yahoo!, Excite, and AltaVista, the truth is that finding, identifying, and categorizing new Web pages requires a great expenditure of time, money and technology.
Lawrence and his co-author, fellow NEC researcher C. Lee Giles, found that most of the major search engines index less than 10 percent of the Web. Even by combining all the major search engines, only 42 percent of the Web has been indexed, they found.
The rest of the Web - trillions of bytes of data ranging from scientific papers to family photo albums - exists in a kid of black hole of information, impenetrable by Web surfers unless they have the exact address of a given Wen site. Even the pages that do end up indexed take an average of six months to be discovered by the search engines, Lawrence and Giles found.
Search engines use computers called "spiders" that continuously surf the Web. They save each page they visit, then follow the links on the page to find other pages. When a user types in a word, the engine looks in its index to see which pages contain it. A page that's not listed in the index will not be found.
The spiders are more likely to find pages that have more links going to them from other pages. Lawrence said that it might make it hard for new sites to make it into the search engine listings.
The pace of indexing marks a striking decline from that found in a similar study conducted by the same researchers just a year and a half ago.
At that time, they estimated the number of Web pages in the world at about 320 million. The most thorough search engine in the study, HotBot, covered about a third of all web pages. Combined, the six leading search engines they surveyed covered about 60 percent of the Web.
While Web Surfers often complain about retrieving too much information from search engines, said Oren Etzioni, chief technology officer of the portal Go2net and a professor of computer science at the University of Washington, failing to capture the full scope of the Web would be to surrender one of the most powerful parts of the digital revolution - the ability to seek and share diverse information across the globe.
Etzioni said the mushrooming size of the Web's audience makes the gulf between what is on the Web and what is retrievable increasingly important.
"There is a real price to be paid if you are not comprehensive," he said. "There may be something that is important to only 1 percent of the people. Well, you're talking about 100,000 people."
Lawrence and Giles estimated the number of web pages by using a special program that searched systematically through 2,500 random Web servers - the computers that hold Web pages. They calculated the average number of pages on each server and extrapolated for the 2.8 million servers on the internet.
By using 1,050 real search queries posed by employees of a NEC Research Institute, a research lab owned by the Japanese electronics company NEC, they were able to estimate the coverage of all the search engines, ranging from 16 percent for Northern Light - a relatively obscure search site that ranks 16th in popularity among similar sites - to 2.5 percent for Lycos, the fourth most popular search engine.
For search engine companies, the findings of the report were unsurprising.
Kris Carpenter, director of search products and services for Excite, the third most popular search engine, said her company purposely ignores a large part of the Web not so much because of weak technology but a lack of consumer interest.
"Most consumers are overwhelmed with just the information that is out there," she said. "It's hard to fathom the hundreds of millions of pages. How do you get your head around that?"
The estimated 800 million pages contain more than 6 trillion characters. By comparison, the 532 miles of shelves in the Library of Congress contain an estimated 20 trillion characters.
Kevin Brown, director of marketing for Inktomi, whose search engine is used by the popular search sites HotBot, Snap and Yahoo, said that search companies have long been aware that they are indexing less and less of the Web. But he argued that users are seeking quality information, not merely quantify.
"There is a point of diminishing returns," he said. "If you want Thai food and there are 14,000 results, the question isn't how many returns you got, but what are the top 10."
Excite's Carpenter said the future of search engines lies not in bigger indexes, but more specialized ones, in which having everything on a given subject, such as baseball, could be indexed and displayed to viewers.
You may be covering a huge percentage of the Web, but you're presenting it in smaller slices," she said. "Lumping everything into one big, be-everything index would be incredibly over-whelming."