Clustering Algorithms for Data Streams and Related Applications

This project is funded by NSF Award # 1447639.


Principal Investigators:

Vladimir Braverman (Principal Investigator)
Alexander Szalay (Co-Principal Investigator)
Randal Burns (Co-Principal Investigator)
Tamas Budavari (Co-Principal Investigator)
Benjamin Van Durme (Co-Principal Investigator)

Research Scientists:

Gerard Lemson
Mark Neyrinck


Stephen Chestnut
Nikita Ivkin
Zaoxing(Alan) Liu
Gregory Vorsanger
Lin Yang


This project will develop novel theoretical methods and algorithms for clustering massive datasets with applications to astronomy, neuroscience and natural language processing. Clustering is the process of creating groups of data based on similarities between individual data points. The developed theoretical methods will be used in applications where clustering algorithms are critical and the input data is extremely large. First, new clustering algorithms will be designed to scale and will allow for better cosmological simulations. The simulations involve billions of particles in each snapshot, and existing clustering algorithms based upon a simple friends-of-friends approach do not scale to these cardinalities. Second, this project will advance the computational capabilities in statistical neuroscience by employing clustering algorithms to discover both regular patterns and anomalies in normal and abnormal brain graphs. Finally, this research will explore the important topic of finding anomalies in massive text streams, such as Twitter. In this setting, one is concerned with detecting anomalous bursts in traffic content that share a similar pattern. These bursts might signal an important political event or a natural disaster. This project will support undergraduate and graduate research aimed at developing skills needed for algorithmic work on massive data sets. There exist numerous heuristics and approximation algorithms for many variants of the clustering problem. However, these methods are often slow or infeasible for applications with massive datasets. This research will improve space and time upper bounds for clustering algorithms in the streaming model. This project will address the k-mean and k-median problems in the dynamic streaming model, extend the results on separable data when the input comes from Euclidian space, improve the bounds in the sliding window model, combine the coresets technique with novel sampling approaches and the method of smooth histograms. The PIs' previous work has already been applied to natural language processing and this project will expand this direction further and explore the important topic of "First Story Detection." Furthermore, this research will explore the similarities and differences between various sampling and sketching techniques, and how they could be used in large multidimensional astronomical databases, like SDSS (Sloan Digital Sky Survey) SkyServer. These novel approaches will provide major speedups for the execution of large statistical aggregate queries. The new streaming algorithms will be used to find substructure in very large cosmological N-body simulations.



Source Code

    This is a preliminary version and we keep updating it. Version 1.0 ---OpenMP enabled

  • Count-sketch based Halo Finder (cs.tar.gz)

  • Pick-and-drop based Halo Finder (pd.tar.gz)


What do "Haloes" look like?

What do "Heavy Hitter" algorithms look like?


We gratefully acknowledge the support and help from