|Probabilistic Query Expansion based on Web Log File|
we know, users leave footprint in the web server when they access the
web site. These footprints record the important information about the
users browse record such as date, client IP, time taken, request,
protocol and so on. In the web, such information is generally gathered
automatically by Web servers and collected in Web
log file. Even though the users are not reluctant to provide
their explicit relevance judgments, by analyzing clicked links
records in Web log files, we still can indirectly find any relationship
between the user░ěs query and relevant document.
In [Cui 2002] the authors propose a new method for query expansion based on the web query logs. The central idea is to extract probabilistic correlations between query terms and document terms by analyzing query logs. These correlations are then used to select high-quality expansion terms for new queries.
First, the authors tested that there is indeed a large gap between the query space and the document space, that is, many terms in the document space are never or seldom used in the user░ěs queries. This fact will dramatically decrease the similarity between the two vectors if they are used in the measurement. While query sessions extracted from the web log files provide a possible way to bridge the gap between the query space and the document space. The behind assumption is that the terms in a query are correlated to the terms in the documents that the user clicked on.
every term in the new query, all correlated document terms are selected
based on the conditional probability obtained by the formula the author
deduced. Then, by combining the probabilities of all query terms, the
cohesion weight of a document term for the new query can be calculated.
Thus, for every query, a list of weighted candidate expansion terms can
be gotten and the top-ranked terms can be selected as expansion terms.
The experiments results show that the log-based method can achieve
substantial performance improvements over the local context analysis.
Created by Lan Man
Last Modified: Nov 11, 2002