Abstract

Web Page Clustering using Heuristic Search in the Web Graph

Ron Bekkerman, Shlomo Zilberstein, James Allan.

Effective representation of Web search results remains an open problem in the Information Retrieval community. For ambiguous queries, a traditional approach is to organize search results into groups (clusters), one for each meaning of the query. These groups are usually constructed according to the topical similarity of the retrieved documents, but it is possible for documents to be totally dissimilar and still correspond to the same meaning of the query. To overcome this problem, we exploit the thematic locality of the Web---relevant Web pages are often located close to each other in the Web graph of hyperlinks. We estimate the level of relevance between each pair of retrieved pages by the length of a path between them. The path is constructed using multi-agent beam search: each agent starts with one Web page and attempts to meet as many other agents as possible with some bounded resources. We test the system on two types of queries: ambiguous English words and people names. The Web appears to be tightly connected; about 70\% of the agents meet with each other after only three iterations of exhaustive breadth-first search. However, when heuristics are applied, the search becomes more focused and the obtained results are substantially more accurate. Combined with a content-driven Web page clustering technique, our heuristic search system significantly improves the clustering results.