Efﬁcient Stream Analysis and its Application to Big Data Processing
Nowadays stream analysis is used in many context where the amount of data and/or the rate at which it is generated rules out other approaches (e.g., batch processing). The data streaming model provides randomized and/or approximated solutions to compute specific functions over (distributed) stream(s) of data-items in worst case scenarios, while striving for small resources usage. In particular, we look into two classical and related data streaming problems: frequency estimation and (distributed) heavy hitters. Solutions to these problems have a wide area of application, spanning from data bases to network monitoring. A less common field of application is stream processing which is somehow complementary and more practical, providing efficient and highly scalable frameworks to perform soft real-time generic computation on streams, relying on cloud computing. This duality allows us to apply data streaming solutions to optimize stream processing systems. In this talk, we introduce a novel algorithm to track heavy hitters in distributed streams and two extensions of a well-known algorithm to estimate the frequencies of data items. We also tackle two related problems and their solution: provide even partitioning of the item universe based on their weights and provide an estimation of the values carried by the items of the stream. We then apply these results to both network monitoring and stream processing. In particular, we leverage these solutions to perform load shedding as well as to load balance parallelized operators in stream processing systems.
Bio: Nicolò Rivetti di Val Cervo recently got a PhD in co-turoship between the LINA / University of Nantes (France) and the DIAG / Sapienza University of Rome. He got both his B.S. and M.S. in Engineering in Computer Science at DIAG / Sapienza University of Rome. Nicolò's research interests are currently focussed on the Data Streaming model where and in particular on the design of algorithms for the estimation of functions over massively distributed data streams. His interests also span over other fields dealing with big data, including Network Monitoring and Stream Processing.