Monday, November 29, 2010

The Deep Web: Surfacing Hidden Value

Engines of the future: Into the deep web     

Search engines see only one in 500 of the accessible pages out there – but a new approach could open up vast new data mines

With billions of web pages in their indexes, you might imagine that if something is online, search engines will find it for you. In reality, the vast majority of web pages are effectively invisible to them.

Some of this "deep web" contains isolated pages with few, if any, hyperlinks, making them difficult to index. Much is stuff you wouldn't want to see anyway: web pages detailing old flight reservations, for example, or out-of-date product reviews on Amazon. However, a large proportion are believed to contain openly accessible databases of everything from information on used cars to the prices of airline seats.

Even ignoring password-protected and other private sites, the deep web is estimated to be at least 500 times the size of the "surface" web visible to search engines. And by some estimates only 16 per cent of the surface web has been indexed by search engines - that is just 0.03 per cent of the whole (see "Lost in cyberspace").
Juliana Freire at the University of Utah in Salt Lake City thinks that even this figure is over-optimistic. She is developing Deep Peep, a specialist search engine that trawls so-called "form-fronted" databases. These are sites with interfaces in which search terms must be typed in order to call up the information stored in the database. Since it isn't practical to ask each of these sites individually for an index of their contents, the challenge is to get this information automatically.

To do this, Deep Peep uses "iterative probing". First, it analyses the form's wording for clues about the nature of the database. For example, the words "assignee" or "invention" are likely to indicate a patent database. Deep Peep uses these clues to fill in the forms, extracts new keywords from the results, and then repeats the process. Tests show Deep Peep can retrieve up to 90 per cent of the information hidden in form-fronted sites.
Mainstream search engines use similar techniques, but the deep web is likely to be growing as fast as its more visible sibling, so even the most powerful search engine will struggle to map more than a fraction of its depths. 

Your search results are likely to remain just a glimpse of a small drop in a very large ocean.
Click here to view the white paper : The Deep web : Surfacing Hidden Value

