Web search engines use indexes to
efficiently retrieve pages containing specified query terms. The
problem of compressed indexes that permit such fast retrieval has a
long history. We consider [1] the problem: assuming that the terms in a
page are generated from a probability distribution, how well can we
build such indexes that allow fast retrieval? Of particular interest is
the case when the probability distribution is Zipfian (or similar power
laws), since these are the distributions that arise on the web.
Then, motivated by contextual
advertising systems and other web applications involving
efficiency-accuracy tradeoffs, we introduce the ``similarity caching''
problem. Here, a cache hit is said to occur if the requested item is
similar but not necessarily equal to some cached item. We study this
problem both experimentally [2] and theoretically in the framework of
competitive algorithms [3].
[1]: Compressed Web Indexes
Joint work with Kumar, Raghavan
(To appear in WWW 2009)
[2]: Nearest-Neighbor Caching for Content-Match Applications
Joint work with Pandey, Broder, Josifovski, Kumar, Vassilvitskii
(To appear in WWW 2009)
[3]: Similarity Caching
Joint work with Kumar, Vassilvitskii
(To appear in PODS 2009)