Sublinear Algorithms Workshop, January 7-9, 2016

Tentative schedule:

All talks will take place in Hodson Hall, auditorium 110 (on the map ). Poster session will take place in the lobby on the second floor. Break out rooms: 301, 305, and 313 on Thursday; 301, 313, and 315 on Friday and Saturday. Internet access is provided through the open WiFi hotspot "JHGuestnet".

Thursday (Jan. 7)
Hodson hall, auditorium 110 (on the map )

8:00 Continental breakfast

9:00 Christian Sohler Testing Cluster Structure of Graphs slides

9:30 Artur Czumaj Testing Directed Graphs slides

10:00 Eric Blais New Algorithms for Testing Monotonicity slides

10:30 Coffee break

11:15 Ilias Diakonikolas A New Approach for Distribution Testing slides

11:45 Muthu Muthukrishnan Estimating Graphical Models Sublinearly

12:15 Lunch

13:45 Oded Goldreich Testing Dynamic Environments notes

14:15 Contributed talks 1-3 Artem Barger, Stephen R. Chestnut and Jalaj Kumar Upadhyay slides_1 slides_2

15:15 Coffee break

16:00 Contributed talks 4-5 Lin Yang and Alan Roytman slides_1

16:30 Contributed talks 6-7 Nathanael Francois and Daniel Reichman slides_1

17:15 Finish

Friday (Jan. 8)
Hodson hall, auditorium 110 (on the map )

8:00 Continental breakfast

9:00 Ronitt Rubinfeld Local Algorithms for Sparse Spanning Graphs

9:30 Noga Ron-Zewi High-rate locally-correctable and locally-testable codes with sub-polynomial query complexity

10:00 Andrew McGregor The Latest on Linear Graph Sketching: Matchings, Densest Subgraphs, and Vertex Connectivity slides

10:30 Coffee break

11:15 Michael Dinitz Towards Resistance Sparsifiers slides

11:45 Suresh Venkatasubramanian Streaming Verification for Graphs and Points slides

12:15 Lunch

13:45 Grigory Yaroslavtsev L-p testing slides

14:15 Huy L Nguyen Distributed Submodular Maximization in MapReduce slides

14:45 Amit Chakrabarti Streaming Set Cover slides

15:15 Coffee break

16:00 Dana Ron Approximately counting triangles in sublinear time slides

16:30 Reception and Poster Session List of posters

17:15 Reception and Poster Session List of posters

17:45 Finish

Saturday (Jan. 9)
Hodson hall, auditorium 110 (on the map )

8:00 Continental breakfast

9:00 David Woodruff Beating CountSketch for Heavy Hitters in Insertion Streams slides

9:30 Alexandr Andoni Sketching and Embedding are Equivalent slides

10:00 Qin Zhang Algorithms for Querying Noisy Distributed/Streaming Datasets slides

10:30 Coffee break

11:15 Open problems session

11:45 Open problems session

12:15 Lunch

13:45 Edo Liberty Online PCA with Spectral Bounds slides

14:15 Shai Vardi Local Computation Algorithms - a short survey slides

14:45 Gregory Valiant Recovering structured probability matrices

15:15 Have a good trip!

Contributed talks

1 Artem Barger k-means for Streaming and Distributed Big Sparse Data

2 Stephen R. Chestnut The Space Complexity of Streaming Sums slides

3 Jalaj Kumar Upadhyay Differentially Private Linear Algebra in the Streaming Model slides

4 Lin Yang Streaming Symmetric Norms via Measure Concentration slides

5 Alan Roytman Approximating Functions over Frequencies and Implicit Matrices in Small Space

6 Nathanael Francois Streaming Property Testing of Visibly Pushdown Languages slides

7 Daniel Reichman Testing Pattern-Freeness

Posters

1 Lin Yang Streaming Symmetric Norms via Measure Concentration

2 Artem Barger k-means for Streaming and Distributed Big Sparse Data

3 Nathanael Francois Streaming Property Testing of Visibly Pushdown Languages

4 Jalaj Kumar Upadhyay Differentially-private linear algebra in the Streaming Model

5 Roei Tell On Being Far from Far and on Dual Problems in Property Testing

6 Sepehr Assadi Dynamic Sketching for Graph Optimization Problems with Application to Cut-Preserving Sketches

7 Clement Canonne Testing Shape Restrictions of Discrete Distributions

8 Themis Gouleakis Sampling correctors

9 Angela Wu Property Testing of G-isomorphism

11 Ludwig Schmidt Nearly Linear-Time Algorithms for Structured Sparsity

12 Meiram Murzabulatov Testing Convexity of Figures Under the Uniform Distribution

13 Venkata Gandikota Local testing in Lattices

14 Hendrik Fichtenberger Towards Constant Size Graphs That Preserve the Local Structure of Large Graphs

15 Gautam Kamath Optimal Testing for Properties of Distributions

16 Jerry Li Nearly Optimal Density Estimation in Nearly Linear Time

Abstracts:

Local Algorithms for Sparse Spanning Graphs
Ronitt Rubinfeld, Massachusetts Institute of Technology, Tel Aviv University

Constructing a spanning tree of a graph is one of the most basic tasks in graph theory. Motivated by several recent studies of local graph algorithms, we consider the following variant of this problem. Let G be a connected bounded-degree graph. Given an edge e in G we would like to decide whether e belongs to a connected subgraph G' consisting of (1 + eps)n edges (for a prespecifi ed constant eps in (0, 1)), where the decision for di fferent edges should be consistent with the same subgraph G'. How many queries to G are needed for this task? We present some answers and some questions.

Joint works with Reut Levi, Guy Moshkovitz, Dana Ron and Asaf Shapira

Testing Dynamic Environments
Oded Goldreich, Weizmann Institute of Science

We initiate a study of learning and testing dynamic environments, focusing on environment that evolve according to a fixed local rule. The (proper) learning task consists of obtaining the initial configuration of the environment, whereas for non-proper learning it suffices to predict its future values. The testing task consists of checking whether the environment has indeed evolved from some initial configuration according to the known evolution rule.
We focus on the temporal aspect of these computational problems, which is reflected in two requirements:
(1) It is not possible to ""go back to the past"" and make a query concerning the environment at time t after making a query at time t' > t;
(2) Only a small portion of the environment is inspected in each time slot.

We present some general observations, an extensive study of two special cases, two separation results, and a host of open problems. The two special cases that we study refer to linear rules of evolution and to rules of evolution that represent simple movement of objects. Specifically, we show that evolution according to any linear rule can be tested within a total number of queries that is sublinear in the size of the environment, and that evolution according to a simple one-dimensional movement can be tested within a total number of queries that is independent of the size of the environment. In my talk, I plan to focus on the general observation and state only the results for the aforementioned two special cases.

Approximately counting triangles in sublinear time
Dana Ron, Tel Aviv University

We consider the problem of estimating the number of triangles in a graph. This problem has been extensively studied in both theory and practice, but all existing algorithms read the entire graph. In this work we design a sublinear-time algorithm for approximating the number of triangles in a graph, and prove a matching lower bound.

The Latest on Linear Graph Sketching: Matchings, Densest Subgraphs, and Vertex Connectivity
Andrew McGregor, University of Massachusetts at Amherst

In this talk, we survey recent work on using random linear projections, a.k.a. sketches, to solve graph problems. Sketches are useful in a variety of computational models including the dynamic graph stream model were the input is defined by a stream of edge insertions and deletions that need to be processed in small space. A large number of problems have now been considered in this model including edge and vertex connectivity, sparsification, densest subgraph, correlation clustering, vertex cover and matching.

Parameterized and Promised Streaming: Making Big Data More Accessible
MohammadTaghi HajiAghayi, University of Maryland at College Park

Big networks are constantly growing in both size and relevance: from social networks such as Google+, Facebook, and Twitter, to brain networks, gene regulatory networks, and health/disease networks. The traditional approach to analyzing such big datasets is to use powerful supercomputers (clusters), ideally large enough to store the data in main memory. The downsides to this approach are that many potential users of big data lack such powerful computational resources (e.g. point-of-sale Bitcoin blockchain analysis), and it can be difficult to solve unexpected problems within such a large infrastructure (e.g. image analysis after the Boston Marathon Bombing). Our main goal here is to enable the processing of huge datasets on computational devices with a limited amount of fast memory, connected to a relatively slow external data source. In particular in this talk, we introduce two new approaches of parameterized and promised streaming for optimization (often for NP-hard) problems over dynamic graph streams in which edges can be both inserted and deleted.

Algorithms for Querying Noisy Distributed/Streaming Datasets
Qin Zhang, Indiana University Bloomington

This talk concerns the following question: can we run distributed and streaming algorithms directly on the noisy datasets, resolve the noise "on the fly", and retain communication/space efficiency compared with the noise-free setting? Here 'noisy' means that a real-world entity may appear in different forms in different datasets, but those variants should be considered as the same universe element when performing statistical estimations. We give a first set of communication/space-efficient solutions for statistical estimations on distributed/streaming noisy datasets.

Distributed Submodular Maximization in MapReduce
Huy L Nguyen, Toyota Technological Institute at Chicago

A wide variety of problems in machine learning, including exemplar clustering, document summarization, and sensor placement, can be cast as constrained submodular maximization problems. A lot of recent effort has been devoted to developing distributed algorithms for these problems. However, these results suffer from high number of rounds, suboptimal approximation ratios, or both. We develop a framework for bringing existing algorithms in the sequential setting to the distributed setting, achieving near optimal approximation ratios for many settings in only a constant number of MapReduce rounds.

High-rate locally-correctable and locally-testable codes with sub-polynomial query complexity
Noga Ron-Zewi, Princeton, Rutgers

Locally-correctable codes (LCCs) and locally-testable codes (LTCs) are special families of error-correcting codes that admit highly efficient sub-linear time algorithms for error correction and detection, respectively, while making a small number of queries to the received word. LCCs and LTCs were originally studied in complexity theory because of their close relationship to program checking and probabilistically-checkable proofs (PCPs), but in recent years there has been a new incentive for their study due to their potential use for massive data transmission and storage purposes and the successful implementation of related families of local codes in cloud storage systems. We show constructions of LCCs and LTCs with constant rate, constant relative distance, and sub-polynomial number of queries and running time of the form $exp(sqrt{log n})$ for length n codewords. Prior to our work, such codes were known to exist only with polynomial number of queries and running time of the form $n^{beta}$ (for a constant $beta>0$). and there were several, quite different, constructions known.

Along the way, we also construct LCCs and LTCs over large (but constant) size alphabets, with the same number of queries and running time of $exp(sqrt{log n})$, which additionally have the property of approaching the Singleton bound: they have almost the best-possible relationship between their rate and distance. Such a result was previously not known for any sub-linear number of queries and running time.

If time permits I will also discuss a more recent work that further improves the running time and number of queries of LTCs to a quasi-polylogarithmic function of the form $(log n)^{O(log log n)}$.

Joint work with Swastik Kopparty, Or Meir and Shubhangi Saraf

Streaming Set Cover
Amit Chakrabarti, Dartmouth

Given a stream of m subsets of the universe {1,...,n}, with m >> n, the Set Cover problem asks for a small subcollection of these subsets that covers the universe. Although this is a classic optimisation problem, its space complexity in such a streaming setting has only recently begun to be understood.

We focus on semi-streaming algorithms: those running in space O(n poly(log n)). We give tight upper and lower bounds that fully resolve the pass/approximation tradeoffs achievable: in short, p semi-streaming passes lead to an approximation ratio of n^{1/(p+1)}. The lower bound uses a novel algebraic construction of an abstract incidence geometry, which may be interesting in its own right as a combinatorial result.

Joint work with Tony Wirth.

New Algorithms for Testing Monotonicity
Eric Blais, University of Waterloo

Monotonicity testing asks: how many queries to an unknown Boolean function f must we make to distinguish monotone functions from functions that are far from monotone? Algorithms for monotonicity testing have typically been very simple: select a pair of comparable inputs x <= y according to some appropriate distribution, verify that f(x) <= f(y), and repeat. (The analysis of these algorithms, however, is far from simple, and indeed has led to fundamental new structural insights about monotonicity and Boolean functions.)

In this talk, we will explore two fundamentally different ways to design algorithms for testing monotonicity. We will survey how the analysis of these algorithms leads to new results, to connections with the analysis of pair testers, and to new questions. This is joint work with Alexander Belov.

A New Approach for Distribution Testing
Ilias Diakonikolas, University of Edinburgh

We study problems in distribution property testing: Given sample access to one or more unknown discrete distributions, we want to determine whether they have some global property or are $\eps$-far from having the property in $\ell_1$ distance. We provide a simple and general approach to obtain upper bounds in this setting, by reducing $\ell_1$-testing to $\ell_2$-testing. Our reduction yields optimal $\ell_1$-testers, by using a standard $\ell_2$-tester as a black-box.

Using our framework, we obtain sample-optimal and computationally efficient estimators for a wide variety of $\ell_1$ distribution testing problems, including the following: identity testing to a fixed distribution, closeness testing between two unknown distributions (with equal/unequal sample sizes), independence testing (in any number of dimensions), closeness testing for collections of distributions, and testing $k$-flatness. For several of these problems, we give the first optimal tester in the literature. Moreover, our estimators are significantly simpler to state and analyze compared to previous approaches.

As our second main contribution, we provide a direct general approach for proving distribution testing lower bounds, by bounding the mutual information. Our lower bound approach is not restricted to symmetric properties, and we use it to prove tight lower bounds for all the aforementioned problems.

(Joint work with Daniel Kane.)

Towards Resistance Sparsifiers
Michael Dinitz, Johns Hopkins University

We study resistance sparsification of graphs, in which the goal is to find a sparse subgraph (with reweighted edges) that approximately preserves the effective resistances between every pair of nodes. We show that every dense regular expander admits a (1+eps)-resistance sparsifier of size O~(n/eps), and conjecture this bound holds for all graphs on n nodes. In comparison, spectral sparsification is a strictly stronger notion and requires Ω(n/eps^2) edges even on the complete graph.

Our approach leads to the following structural question on graphs: Does every dense regular expander contain a sparse regular expander as a subgraph? Our main technical contribution, which may of independent interest, is a positive answer to this question in a certain setting of parameters. Combining this with a recent result of von Luxburg, Radl, and Hein~(JMLR, 2014) leads to the aforementioned resistance sparsifiers.

Joint work with Robert Krauthgamer and Tal Wagner

Recovering structured probability matrices
Gregory Valiant, Stanford

We consider the problem of accurately recovering a matrix B of size M x M , which represents a probability distribution over M^2 outcomes, given access to an observed matrix of "counts" generated by taking independent samples from the distribution B. How can structural properties of the underlying matrix B be leveraged to yield computationally efficient and information theoretically optimal reconstruction algorithms? When can accurate reconstruction be accomplished in the sparse data regime? This basic problem lies at the core of a number of questions that are currently being considered by different communities, including community detection in graphs, learning structured models such as topic models or hidden Markov models, and the efforts from the natural language processing community to compute "word embeddings". Many aspects of this problem - both in terms of learning and property testing/estimation and on both the algorithmic and information theoretic sides - remain open. We will present some recent progress on this question, and discuss several intriguing open directions related to achieving computationally efficient and information theoretically near-optimal algorithms for extracting structure underlying such count matrices."

Contributed Talks Abstracts:

Streaming Symmetric Norms via Measure Concentration
Lin Yang, Johns Hopkins University

We characterize the streaming space complexity of every symmetric norm l (a norm on Rn invariant under sign-flips and coordinate-permutations), by relating this space complexity to the measure-concentration characteristics of l. Specifically, we provide upper and lower bounds on the space complexity of approximating the norm of the stream, where both bounds depend on the median of l(x) when x is drawn uniformly from the l2 unit sphere. The same quantity governs many phenomena in high-dimensional spaces, such as large-deviation bounds and the critical dimension in Dvoretzky's Theorem. The family of symmetric norms contains several well-studied norms, such as all lp norms, and indeed we provide a new explanation for the disparity in space complexity between p ≤2 and p>2. In addition, we apply our general results to easily derive bounds for several norms were not studied before in the streaming model, including for example the top-k norm and the k-support norm, which was recently shown to be effective for machine learning tasks. Overall, these results make progress on two outstanding problems in the area of sublinear algorithms (Problems 5 and 30 in http://sublinear.info).

L-p testing
Grigory Yaroslavtsev, University of Pennsylvania

I will introduce a framework for systematic studies of sublinear algorithms for approximately testing properties of real-valued data with respect to Lp distances for p >= 1. Such algorithms distinguish datasets which either have (or are close to having) a certain property from datasets which are far from having it with respect to Lp distance. For applications involving noisy real-valued data, using Lp distances allows algorithms to withstand noise of bounded Lp norm. While the classical property testing framework developed with respect to Hamming distance has been studied extensively, testing with respect to Lp distances has received little attention.

We use our framework to design simple and fast algorithms for classic problems, such as testing monotonicity, convexity and the Lipschitz property, and also distance approximation to monotonicity. In particular, for functions over the hypergrid domains [n]^d , the complexity of our algorithms for all these properties does not depend on the linear dimension n. I will also describe applications of our Lp-testing algorithms to testing assumptions used in the analysis of gradient descent methods for convex optimization and suggest multiple directions for future work.

Joint work with Piotr Berman and Sofya Raskhodnikova (STOC'14)

k-means for Streaming and Distributed Big Sparse Data
Artem Barger, University of Haifa

We provide the first streaming algorithm for computing a provable approximation to the k-means of sparse Big data. Here, sparse Big Data is a set of n vectors in R^d, where each vector has O(1) non-zeroes entries, and d > n. E.g., adjacency matrix of a graph, web-links, social network, document-terms, or image-features matrices.

Our streaming algorithm stores at most log(n)*k^{O(1)} input points in memory. If the stream is distributed among M machines, the running time reduces by a factor of M, while communicating a total of Mk^{O(1)} (sparse) input points between the machines.

Differentially-private linear algebra in the Streaming Model
Jalaj Kumar Upadhyay, Pennsylvania State University

In this talk, we show how to perform linear algebraic tasks while preserving privacy when the data is streamed online. We give a new construction of a fast-Johnson-Lindenstrauss transform that preserves privacy, and use it to generate private sketch of the streamed data. We give optimal, up to logarithmic factor, space data-structures that can compute low rank approximation, linear regression, and matrix multiplication, while preserving differential privacy.

Testing Pattern-Freeness
Daniel Reichman

We consider the problem of testing pattern-freeness: given a string $I$ and a fixed pattern $J$ over a finite alphabet $\Sigma$, decide whether $I$ has no occurrence of $J$ or alternatively, one has to modify $I$ in at least an $\epsilon$-fraction of its locations to obtain a string that is $J$-free. The two-dimensional analog with 2D-images and patterns is studied as well.

We show that other than a small number of specific patterns, there is a one-sided tester for this problem whose query complexity is $2 k/\epsilon$ for patterns of size $k$. Our algorithm is very simple and its query complexity improves on %is better than known bounds of current sublinear algorithms for this problem. We also show a lower bound, proving that any tester erring with probability at most $1/3$ must make $\Omega(k)$ queries to $I$.

The space complexity of streaming sums
Stephen R. Chestnut, ETH Zurich

A central problem in the theory of algorithms for data streams is to determine which functions on a stream can be approximated in sublinear, and especially poly-logarithmic, space. Given a function g, we study the space complexity of approximating sum_{i=1}^n g(|f_i|), where f in Z^n is the frequency vector of a turnstile stream. This is a generalization of the well-known frequency moments problem, and previous results apply only when g is monotonic or has a special functional form. Our contribution is to give a condition such that, except for a narrow class of functions g, there is a space-efficient approximation algorithm for the sum if and only if g satisfies the condition. The functions g that we are able to characterize include all convex, concave, monotonic, polynomial, and trigonometric functions, among many others, and is the first such characterization for non-monotonic functions. This is joint work with Vladimir Braverman, Lin F. Yang, and David P. Woodruff.

Approximating Functions over Frequencies and Implicit Matrices in Small Space
Alan Roytman, Tel Aviv University

As the amount of data being generated continues to grow at a staggering rate, streaming algorithms are increasingly becoming more important as a practical tool to analyze and make sense of all the information. In practice, such applications generate vast amounts of data in a very short period of time, and hence it is infeasible to store everything. This presents a pressing question: when is it possible to summarize data while still providing approximate solutions with good theoretical guarantees?

We study this question in the context of approximating functions over frequencies and implicit matrices. For frequency-based functions, we give a complete characterization of when the sum of such functions over frequencies of elements can be approximated using small space in the sliding window model. In addition, for a broad class of functions, we formalize the notion of universality for streaming over sliding windows, and construct a universal algorithm that can simultaneously approximate any such function from the class without knowing the function to be approximated in advance.

We also give general techniques for approximating functions over implicit matrices, which we describe as matrices generated via the outer product of two vectors presented as a stream. Our results automatically yield improved space bounds for the problem of identifying correlations in data streams, in which we are given a stream of pairs of integers whose frequencies induce a joint distribution and marginal distributions, and we would like to measure how close the joint distribution is to the product of the marginal distributions.

Local Computation Algorithms - a short survey
Shai Vardi, Weizmann Institute of Science

Local computation algorithms (LCAs) produce small parts of a single solution to a given search problem in sublinear time. Consider, for example, a massive graph on which we would like to compute a maximal independent set (MIS), but never need the entire solution at any given time. Instead, we are occasionally queried about whether certain vertices are in the MIS. An LCA would allow us to reply to these queries consistently (i.e., if the LCA is queried on all of the vertices the replies would conform to a coherent solution), using sublinear time and space. I will give a short overview of some recent results on LCAs.

Streaming Property Testing of Visibly Pushdown Languages
Nathanael Francois, TU Dortmund, Germany

In the context of language recognition, we demonstrate the superiority of streaming property testers against streaming algorithms and property testers, when they are not combined. Initiated by Feigenbaum et al., a streaming property tester is a streaming algorithm recognizing a language under the property testing approximation: it must distinguish inputs of the language from those that are ε-far from it, while using the smallest possible memory (rather than limiting its number of input queries).

Our main result is a streaming ε-property tester for visibly pushdown languages (VPL) with one-sided error using memory space poly((logn)/ε).

This constructions relies on a (non-streaming) property tester for weighted regular languages based on a previous tester by Alon et al. We provide a simple application of this tester for streaming testing special cases of instances of VPL that are already hard for both streaming algorithms and property testers.

Our main algorithm is a combination of an original simulation of visibly pushdown automata using a stack with small height but possible items of linear size. In a second step, those items are replaced by small sketches. Those sketches relies on a notion of suffix-sampling we introduce. This sampling is the key idea connecting our streaming tester algorithm to property testers.

Tentative schedule:

Abstracts:

Local Algorithms for Sparse Spanning Graphs Ronitt Rubinfeld, Massachusetts Institute of Technology, Tel Aviv University

Testing Dynamic Environments Oded Goldreich, Weizmann Institute of Science

Approximately counting triangles in sublinear time Dana Ron, Tel Aviv University

The Latest on Linear Graph Sketching: Matchings, Densest Subgraphs, and Vertex Connectivity Andrew McGregor, University of Massachusetts at Amherst

Parameterized and Promised Streaming: Making Big Data More Accessible MohammadTaghi HajiAghayi, University of Maryland at College Park

Algorithms for Querying Noisy Distributed/Streaming Datasets Qin Zhang, Indiana University Bloomington

Distributed Submodular Maximization in MapReduce Huy L Nguyen, Toyota Technological Institute at Chicago

High-rate locally-correctable and locally-testable codes with sub-polynomial query complexity Noga Ron-Zewi, Princeton, Rutgers

Streaming Set Cover Amit Chakrabarti, Dartmouth

New Algorithms for Testing Monotonicity Eric Blais, University of Waterloo

A New Approach for Distribution Testing Ilias Diakonikolas, University of Edinburgh

Towards Resistance Sparsifiers Michael Dinitz, Johns Hopkins University

Recovering structured probability matrices Gregory Valiant, Stanford