Seminario
Interdipartimentale di Algoritmica
Lunedì 19 Gennaio 2004 ore 11:00
UbiCrawler: a scalable fully distributed web crawler
Massimo Santini
Università di Modena e Reggio Emilia
DIS - Dipartimento di Informatica e Sistemistica,
via Salaria 113
Aula C2, piano secondo
Abstract:
This talk will report our experience in implementing UbiCrawler, a
scalable distributed web crawler, using the Java programming
language. The main features of UbiCrawler are platform independence,
linear scalability, graceful degradation in the presence of faults,
a very effective assignment function (based on consistent hashing)
for partitioning the domain to crawl, and more in general the
complete decentralization of every task. The necessity of handling
very large sets of data has highlighted some limitation of the Java
APIs, which prompted the authors to partially reimplement them.