Streaming algorithms for finding frequent items

Streaming model supplies solutions for handling enormous data flows for over 20 years now. The model works with sequential data access and states sublinear memory as its primary restriction. Although the majority of the algorithms are randomized and approximate, the field facilitates numerous applications from handling networking traffic to analyzing cosmology simulations and beyond. This thesis focuses on one of the most foundational and well-studied problems of finding heavy hitters, i.e. frequent items:

We challenge the long-lasting complexity gap in finding heavy hitters with L2-guarantee in the insertion-only stream and present the first optimal algorithm with a space complexity of O(1) words and O(1) update time. Our result improves on Count sketch algorithm with space and time complexity of O(log n) by Charikar et al. 2002.
We consider the L2-heavy hitter problem in the interval query settings, rapidly emerging in the field. Compared to well known Sliding Window model where an algorithm is required to report the function of interest computed over the last N updates, Interval Query provides query flexibility, such that at any moment t one can query the function value on any interval (t1,t2) subset (t-N,t). We present the first L2-heavy hitter algorithm in that model and extend the result to the estimation of all streamable functions of frequency vector.
We provide the experimental study for the recent space optimal result on streaming quantiles by Karnin et al. 2016. The problem can be considered as a generalization to the heavy hitters. Additionally, we suggest several variations to the algorithms which improve the running time from O(1/eps) to O(log 1/eps), provide twice better space/precision tradeoff and extend the algorithm for the case of weighted updates.
We establish the connection between finding “haloes”, i.e. dense areas, in cosmology N-body simulation and finding heavy hitters. We build the first halo finder and scale it to handle datasets with up-to 10^12 particles via GPU boosting, sampling and parallel I/O. We investigate its behavior and compare it to traditional in-memory halo finders. Our solution pushes the memory footprint from several terabytes down to less than a gigabyte, therefore, make the problem feasible for small servers and even desktops.

Speaker Biography

I am a Ph.D. candidate at the Department of Computer Science at Johns Hopkins University, working with Prof. Vladimir Braverman. I received B.S and M.S degrees in Applied Math in 2011 and 2013 correspondingly both from Moscow Institute of Physics and Technology and enrolled in the Ph.D. program in 2014. My primary interest is concentrated around streaming algorithms and its applications, starting from theoretical foundations behind the time and space complexity guarantees ending with hardware specific optimizations in real-world applications.