Seminario Interdipartimentale di Algoritmica
 

Monday, November 30th, 2009, 12:00 noon
Search engines:  query logs, query-flow graphs and query-reformulation classification
Paolo Boldi, Computer Science Department, University of Milano

DI - Department of Computer Science Via Salaria 113
Seminar Room, third floor

Abstract

Query logs record the queries and the actions of the users of search  engines, and as such they contain valuable information about the  interests, the preferences, and the behavior of the users, as well as  their implicit feedback to search-engine results. Mining the wealth of information available in the query logs has many important applications including query-log analysis, user profiling and personalization, advertising, query recommendation, and more.

We introduce the query-flow graph, a graph representation of the interesting knowledge about latent querying behavior.  Intuitively, in the query-flow graph a directed edge from query q_i to query q_j means that the two queries are likely to be part of the same ``search mission''. Any path over the query-flow graph may be seen as a searching behavior, whose likelihood is given by the strength of the edges along the path.

The query-flow graph is an outcome of query-log mining and, at the same time, a useful tool for it. Using this approach we build a real-world query-flow graph from a large-scale query log and we demonstrate its utility in
concrete applications, namely, finding logical sessions, and query recommendation.

We further build an accurate model for classifying user query reformulations into broad classes (generalization, specialization, error correction or parallel move), achieving 92% accuracy. We apply the model to automatically label two large query logs, creating annotated query-flow graphs. We study the resulting reformulation patterns, finding results consistent with previous studies done on smaller manually annotated datasets, and discovering new interesting patterns, including connections between reformulation types and topical categories.