Home

Site Map Links Web Mining Information  Retrieval Glossary Bibliography
Probabilistic Query Expansion based on Web Log File

As we know, users leave footprint in the web server when they access the web site. These footprints record the important information about the users browse record such as date, client IP, time taken, request, protocol and so on. In the web, such information is generally gathered automatically by Web servers and collected in Web log file. Even though the users are not reluctant to provide their explicit relevance judgments, by analyzing clicked links records in Web log files, we still can indirectly find any relationship between the user°Øs query and relevant document.

In [Cui 2002] the authors propose a new method for query expansion based on the web query logs. The central idea is to extract probabilistic correlations between query terms and document terms by analyzing query logs. These correlations are then used to select high-quality expansion terms for new queries.

First, the authors tested that there is indeed a large gap between the query space and the document space, that is, many terms in the document space are never or seldom used in the user°Øs queries. This fact will dramatically decrease the similarity between the two vectors if they are used in the measurement. While query sessions extracted from the web log files provide a possible way to bridge the gap between the query space and the document space. The behind assumption is that the terms in a query are correlated to the terms in the documents that the user clicked on.

For every term in the new query, all correlated document terms are selected based on the conditional probability obtained by the formula the author deduced. Then, by combining the probabilities of all query terms, the cohesion weight of a document term for the new query can be calculated. Thus, for every query, a list of weighted candidate expansion terms can be gotten and the top-ranked terms can be selected as expansion terms. The experiments results show that the log-based method can achieve substantial performance improvements over the local context analysis.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Created by Lan Man

Last Modified: Nov 11, 2002