Statistical Learning Theory Meets Big Data: Randomized Algorithms for Extracting Frequent Itemsets and Association Rules
Abstract: The tasks of extracting Frequent Itemsets and Association Rules are fundamental primitives in data mining and database applications. Exact algorithms for these problems exist but require scanning the entire dataset, possibly multiple times. When data is too big to be analyzed in its entirety, and obvious approach is to analyze a sample of the data. The major difficulty in this approach is bounding the probability of under- or over-sampling any one of an unknown number of frequent itemsets. Our work circumvents this issue by applying the statistical concept of VC - dimension to develop a novel technique for providing tight bounds on the sample size that guarantees approximation within user-specified parameters (joint work with M. Riondato)
In a subsequent work we extend this technique to a novel randomized parallel algorithm to the problem. Our algorithm achieves near-linear speedup while avoiding costly replication of data.. We formulated and implemented the algorithm in the MapReduce parallel computation framework (joint work with M. Riondato, J. DeBrabant and R. Fonseca).
Bio: Eli Upfal a professor of computer science at Brown University, during 2002-2007 he was also the department chair. Before coming to Brown in 1998 he was a researcher and project manager at the IBM Almaden Research Center in California, and a professor at the Weizmann Institute in Israel. He received an undergraduate degree in mathematics and statistics and a doctorate degree in computer science from the Hebrew University in Jerusalem, Israel. His research focuses on the design and analysis of algorithms. In particular, randomized algorithms and probabilistic analysis of algorithms. Applications range from combinatorial and stochastic optimization to routing and communication networks, computational biology, and computational finance and statistical machine learning.