## Spring 2006

February 9, 2006

Node-to-node latencies can be seen as a notion of distance in networks such as the Internet. For a number of applications it is useful to represent this distance via coodinates in a low-dimensional Euclidean space and, moreover, compute these coordinates in a decentralized fashion, with low overhead on participating nodes. In particular, each node is allowed only a near-constant number of distance measurements. This is in sharp contrast with existing theoretical work on metric embeddings which assumes full (and centralized) access to the distance matrix.

Here we show how to extend theoretical guarantees for embedding to this decentralized setting. We also consider a more basic problem of ‘triangulation’, in which one uses the triangle inequality to infer the distances that have not been measured. We also report on some successes of this style of analysis in the context of a deployed system.

February 17, 2006

Deformation is a key component in many applications including virtual surgical simulation and the animation of digital characters in the movie industry. Previous deformation methods have led to non-intuitive ways of specifying the deformation or have been too expensive to compute in real-time. This talk will focus on three methods we have developed for creating intuitive deformations of 3D shapes. The first method is a new, smooth volumetric subdivision scheme that allows the user to specify deformations using conforming collections of tetrahedra, which generalizes the widely used Free-Form Deformation method. The next technique extends a fundamental interpolant in Computer Graphics called Barycentric Coordinates and lets the user manipulate low-resolution polygon meshes to control deformations of high-resolution shapes. Finally, the talk will conclude with some of our recent work on creating deformations described by collections of points using a technique called Moving Least Squares.

February 20, 2006

The rapid growth of commodity hardware capacity has made it feasible to build powerful unified information systems on top of decentralized, inexpensive off-the-shelf components. Such clustered systems consolidate the resources of component nodes to serve demanding workloads, while the nodes can be heterogeneous in capacity, vendor, hardware, or even software. Clustered systems provide an attractive alternative to traditional centralized parallel systems, owing to their cost-effectiveness and increased scalability.

However, heterogeneity among the nodes in clustered systems pose significant challenges to the effectiveness of traditional load management techniques and endangers the performance as well as scalability of those systems. In this talk, we present a set of new techniques for efficient load management in clustered systems built with heterogeneous components. Our techniques adapt to the heterogeneous infrastructure and non-uniform workloads of clustered systems with no a priori knowledge of either. They also mesh well with the decentralized architecture of clustered systems, because they build on tunable hashing schemes and require little or no shared state among the cluster nodes. Our techniques facilitate the construction of self-managing resource allocation and load balancing systems in order to improve the performance of heterogeneous clustered systems and dramatically reduce their management cost. This allows clustered systems to scale to sizes that were previously unmanageable and allows a clustered system to enlist highly heterogeneous resources automatically.

February 23, 2006

Whole-genome sequencing of many species has presented us with the opportunity to deduce the evolutionary relationships between each and every nucleotide. In this talk, I will present algorithms for this problem, which is that of multiple whole-genome alignment. The sensitivity of whole-genome alignments to parameter values can be ascertained through the use of alignment polytopes, which will be explained. I will also show how whole-genome alignments are used in comparative genomics, including the identification of novel genes, the location of micro-RNA targets, and the elucidation of cis- regulatory element and splicing signal evolution.

*Distinguished Lecturer*

February 23, 2006

I have been working for the last decade to get all scientific data and literature online and cross-indexed. Progress has been astonishing, but the real changes are going to happen in the next decade. First I will talk about what is happening to scientific literature — the funding agencies are forcing it into the public domain. Then I will discuss scientific data which traditionally has been hoarded by investigators (with notable exceptions). The forced electronic publication of scientific literature and data poses some deep technical questions: just exactly how does anyone read and understand it? How can we preserve so that it will be readable in a century? I have been pursuing these questions in Geography (with http://TerraService.Net), Astronomy (with the World-Wide telescope — e.g. http://SkyServer.Sdss.org and http://www.ivoa.net/) and more recently in bio informatics (with portable PubMedCentral). Incidental to this, each intellectual discipline X is building an X-informatics and computational-X branch. It is those branches in collaboration with Computer Science that are faced with solving these issues.

**Speaker Biography**: Jim Gray is part of Microsoft’s research group. His work focuses on databases and transaction processing. Jim is active in the research community, is an ACM, NAE, NAS, and AAAS Fellow, and received the ACM Turing Award for his work on transaction processing. He edits of a series of books on data management, and has been active in building online databases like http://terraService.Net and http://skyserver.sdss.org.

February 24, 2006

Modern computer security research spans multiple levels of abstractions, dealing with topics ranging from provably secure cryptographic building blocks to large and socially important distributed systems. In this talk I describe three projects that represent different points along this continuum. (1) I describe my research improving the security of the popular Secure Shell (SSH) protocol against a privacy attack that I discovered. I then describe how I prove the security of my fixes using the principles of modern reduction-based provable security. My research here is both formal and systems-oriented; the latter means that when I create my formal models and definitions, I pay close attention to the constraints and goals of the existing SSH protocol. (2) I describe a method for remotely distinguishing two physical devices even if those two devices have the same hardware and software configuration. My technique exploits a new information leakage channel in the TCP protocol: by analyzing the stream of TCP packets from a device, it is in some cases possible to infer information about the transmitting device’s clock skew. My approach has applications to computer forensics, detecting virtualization technologies, counting the number of devices behind a NAT, and de-anonymizing anonymized network traces. (3) I describe serious attacks against a deployed electronic voting machine, namely a version of the Diebold AccuVote-TS electronic voting machine. I then describe some social and technical implications of my results as well as future directions.

March 2, 2006

The goal of code obfuscation is to make a program completely “unintelligible” while preserving its functionality. Obfuscation has been used for years in attempts to prevent reverse engineering, e.g., in copy protection and licensing schemes. Recently, spammers have utilized it to conceal code that spawns pop-ups. Finally, obfuscation is a cryptographer’s dream: nearly any cryptographic task could be achieved *securely* by writing a simple program and then obfuscating it (if possible!).

Barak et al (2001) formalized the notion of obfuscation and demonstrated the existence of a (contrived) class of functions that cannot be obfuscated. In contrast, Canetti and Wee gave an obfuscator for a particular class of simple functions, called point functions, that output 1 on a single point (and output 0 everywhere else). Thus, it seemed completely possible that most functions of interest can be obfuscated, even though in principle general purpose obfuscation is impossible.

We argue that this is unlikely to be the case, by showing that general classes of functions that one would like to obfuscate, are actually not obfuscatable. In particular, we show that for one of our classes, given an obfuscation of two functions in the class, each with a *secret* of its own, one can compute a hidden function of these secrets. Surprisingly, this holds even when the secrets are chosen completely independently of each other. Our results hold in an augmentation of the formal obfuscation model of Barak et al (2001) that includes auxiliary input.

March 3, 2006

How should one make a sequence of decisions in an uncertain environment? If one had complete knowledge of the future, it would be an offline problem of pure *optimization*. This talk relates online repeated decision-making problems to their offline optimization counterparts.

We begin with a general continuous decision making problem, which can be viewed as online convex optimization with *extremely* little feedback. As an example, consider a company that wishes to repeatedly choose its monthly advertising portfolio so as to maximize its average monthly profit. Each month, after deciding how much to advertise on a number of channels, it finds out only its total profit, and not how much it would have made had it advertised differently. This profit is a function of its advertising expenditures as well as many other decisions and unrelated economic factors. More precisely, a decision-maker must choose a sequence of points x_1, x_2, …, from some fixed convex feasible set. After choosing the t’th point, x_t, *only* the reward f_t(x_t) is revealed and no other information about the reward function f_t. Moreover, the functions f_1, f_2, …, may be arbitrary (bounded convex) functions and may be completely unrelated. We give a very simple randomized algorithm that achieves average performance approaching that of the best single decision in hindsight (at a rate inversely polynomial in the number of decisions). One key idea is to approximate the gradient of a function by evaluating it at a single point. Applications include advertising, production, and investment portfolios.

We next consider discrete repeated decision-making problems and their offline combinatorial optimization counterparts. We show how, in the face of complete uncertainty, almost any offline combinatorial optimization algorithm can be modified and applied online to the repeated decision-making setting, with nearly the same efficiency and performance guarantees. We discuss applications to adaptive routing, online auctions, and classification learning.

March 9, 2006

The information revolution has made massive amounts of data available from a myriad of sources such as biology, finance, e-commerce, advertising and the web. Handling these high-dimensional, often noisy and inconsistent datasets has become an important focus of algorithmic research. (1) Rank Aggregation deals with combining inconsistent rankings suggested by different voters. This old problem from social choice theory has recently attracted the attention of computer scientists in surprising new contexts such as search engines and information retrieval systems.I will present new, improved algorithms, and discuss related results and follow-up work on cluster aggregation, hierarchical clustering and phylogeny. (2) Sublinear algorithms compute functions by considering only a small sample of the input. This is especially important when the input is massive. I will discuss new results in sublinear algorithms in the context of property testing, property-preserving data reconstruction and nearest-neighbor searching. Time permitting, I will briefly touch upon other topics I am interested in such as self-improving algorithms, computational geometry and complexity.

March 13, 2006

Each year more than 240 terabytes of information in printed material is produced (Lyman and Varian, 2003), far more than humans have the capacity to absorb. While ad-hoc document retrieval (e.g. Web search engines) has sped access to these text collections, information needs are often for information at levels of granularity smaller than an entire document. Information extraction has been proposed as a solution for returning textual information at the granularity of a single fact or relationship, but application of these methods has been limited by the need for extensive manual annotation of training data. In addition, research on information extraction has focused on extraction from single documents in isolation without regard to the entire corpus context.

This talk proposes the use of minimally supervised fact extraction from multiple documents as a enabling component for high-precision information retrieval. Fact extractors (Phrase-Conditional Likelihood, Naive Bayes, and Conditional Random Fields) are trained from a small set of example facts and found text on the web. The trained systems are then used to extract facts from documents retrieved from the web, and then fusion methods (Viterbi Frequency, Label Marginal Fusion) pick correct facts from a set of proposed candidates. The performance and utility of these fact extraction and fusion methods is empirically evaluated on four tasks: biographic fact extraction, management succession timeline construction, cross-document coreference resolution, andontology acquisition for question answering.

March 14, 2006

The annotated data required to train most modern natural language processing systems is expensive and domain-specific. We aim to build useful models starting with cheaper data — ideally, raw text. This talk is about weighted grammar induction, an unsupervised learning problem. While much previous work on grammar induction has considered the acquisition of grammar rules, we consider simple, over-generating grammars (for coverage) and focus instead on the weights. We address two problems with the standard maximum likelihood (ML) approach: the desired model is under-specified, and search is hard. We present Contrastive Estimation, a novel, elegant and efficient generalization of ML that allows the learner to exploit basic linguistic intuitions encoded as “implicit” negative evidence. We separately tackle the search problem with a new technique, Structural Annealing, that first focuses the learner on easy explanations, then challenges it with gradually harder problems. We present state-of-the-art performance on unsupervised dependency parsing in six diverse languages.

March 16, 2006

The Internet is probably the most important technological creation of our times. It provides many immensely useful services to the masses for free, including such essentials as web search, web mail, and web portals. These services are expensive to maintain, and depend upon advertisement revenue to remain free. One of the most effective ways to allocate advertisement space on the web is through auctions. In this talk, I will discuss some of the theoretical challenges in the design of online advertisement auctions, and will present some of our recent results addressing these issues. In particular, I will focus on mechanism design for budget-constrained bidders, multi-unit auctions for perishable goods with unknown supply, and dynamics of bid optimization.

March 17, 2006

Modern computation is pervasive. Devices from employee ID badges to dog collars are now expected to securely communicate with their environment, often using a device with limited resources such as an RFID chip. To securely communicate, an RFID chip needs to perform authentication and encryption operations that often exceed its computational resources. One solution is to allow an RFID chip to outsource some of its costly computations to some other (potentially malicious) device in the environment, for example, a nearby computer.

In this talk, I will present a comprehensive security definition encapsulating what it means to *securely* outsource computations to a potentially malicious device. On the negative side, I will discuss some operating environments in which secure outsourcing is impossible to achieve. On the positive side, I will present outsourcing protocols for many desirable operating environments. In fact, we show that for any exponentiation-based cryptographic scheme with n-bit exponents, an RFID chip would normally have to perform O(n) modular multiplications, and yet when following our secure outsourcing protocol this load reduces to O(log2 n). For example, this may reduce an RFID’s work load from 1 million to 400 operations.

This is joint work with Anna Lysyanskaya (Brown University) that appeared in TCC 2005.

March 27, 2006

Automatic translation between natural languages using empirical learning methods is an active research area. Much of the work in this area is concentrated on acquiring models of translation from expensive resources, including large databases containing hundreds of thousands of translated sentences, which are available for only a small number of the language pairs between which one might wish to translate. This talk describes the automatic acquisition of translations into English for unknown foreign language words given the lack of any high-quality translation learning resources, which is the reality for nearly all of the world’s languages. The methods presented in this talk require minimal training supervision; the necessary inputs consist of only a few hundred thousand words of raw news text in a language of interest, such as Uzbek, and a translation dictionary into English for some related language, such as Turkish. We show how diverse measures of cross-language word similarity can be combined to learn translations for a considerable subset of the vocabulary of a language for which no translations are initially known. These similarities include statistical models of related-language cognate words, various cross-language comparisons of how words are used in monolingual text, and the distribution of terms in dated news articles. The ultimate result is a functional translation capability for new language pairs using minimal training resources and almost no human supervision.

March 29, 2006

Consider the problem of assigning consultants to projects. Each consultant should be assigned a project, in such a way that the cost of these assignments is minimized. Cost could represent the travel time of the consultant to the project site, or the ability of the consultant to complete the task. This problem is an instance of the minimum-cost matching problem: one of the most important problems in computer science. Substantial previous work has lead to efficient algorithms to compute these matchings. However, the consultant-assignment problem is naturally online, in that we do not know what the list of projects will be in advance. Projects arrive one-by-one, and as each project appears we must assign a consultant. The requirement that we make assignments as we go, without prior knowledge of the list of projects, makes the problem substantially more difficult. Prior work (by Khuller, Mitchell, and Vazirani) has demonstrated that this problem is intractable if the costs are arbitrary. If the costs form a metric (satisfying symmetry and triangle inequality, for example distances along a sphere) then the best possible deterministic guarantee is that the cost of our matching is at most 2K-1 times the optimum, where K is the number of assignments. In this talk, I will describe the first randomized algorithm for the online matching problem. By using a simple randomized greedy technique combined with prior work in metric embeddings (in particular the result of Fakcharoenphol, Rao, and Talwar), I will guarantee that the cost of our matching is at most O(log3 K) times the optimum. This talk is based on the paper “Randomized Online Algorithms for Minimum Metric Bipartite Matching” which appeared at SODA 2006, and represents joint work with Akash Nanavati and Laura Poplawski.

March 30, 2006

A (proper) graph coloring is a function which assigns a color to each vertex of a graph, so that no two adjacent vertices receive the same color. A simple greedy strategy can produce such a coloring in linear time, using at most D+1 colors where D is the maximum degree of the graph. If we want to produce not just any coloring, but a uniformly random coloring, in polynomial time, how many colors are needed? To address this question, we will use the Markov chain Monte Carlo approach (MCMC). This is a general and powerful technique for sampling random elements from combinatorially-defined sets. Other well-known applications of MCMC include Google’s PageRank, many dynamical models in statistical physics (e.g. for gas dynamics, ferromagnetism, etc), and approximating the permanent. I will describe a simple Markov chain whose state space is proper graph colorings, and whose distribution converges to uniform as the number of steps tends to infinity. Simulating this chain for a finite number of steps is the best known algorithm for sampling graph colorings approximately uniformly at random. I will tell you about a sequence of results for this chain, each of which introduced a new general technique for the analysis of MCMC algorithms.

March 31, 2006

As broadband access becomes more prevalent, and the Internet used for more bandwidth demanding applications, traffic on the backbone network becomes increasingly unpredictable, at the same time as performance requirements become more stringent. Unfortunately, the current backbone architecture cannot provide the desired flexibility and predictability in performance.

We propose that Valiant Load Balancing (VLB) be used in backbone network design, to provide guaranteed service to all traffic matrices which satisfy some aggregate constraints. The network has the additional advantage of having simple routing, recovering quickly from component failures, and utilizing capacity efficiently. Further, Valiant Load Balancing provides a framework for designing a network under a wide variety of realistic constraints.

April 6, 2006

Finding sets of points that conform to an underlying spatial model is a conceptual simple, but potentially expensive, task. In this talk I consider the computational issues inherent in extracting structure from large, noisy data sets. I discuss how a trade-off exists between traditional search algorithms based on spatial data structures and more recent multiple tree algorithms, leaving both approaches with potential drawbacks. I describe a new type of tree-based search algorithm that uses a dynamic, variable number of tree nodes to adapt to both structure in the data and search state itself. I show that this new approach addresses potential drawbacks in both the constructive and multiple tree approaches. Further, I show that this algorithm leads to significant benefits on a variety of problem domains. While the problem of finding spatial structure arises in a wide range of domains, the primary motivating problem throughout this talk is the task of efficiently linking faint asteroid detections. Here the goal is to link together individual point observations in order to find and extract new asteroid trajectories in the data. Future astronomical surveys will provide a wealth of observational data to this end, greatly increasing both the ability to find new objects and the scale of the problem. Thus efficient algorithms are vital to effectively handle the massive amount of data that is expected.

April 14, 2006

This talk describes a vision-based, large-area, simultaneous localization and mapping (SLAM) algorithm that respects the low-overlap imagery constraints typical of autonomous underwater vehicles (AUVs) while exploiting the inertial sensor information that is routinely available on such platforms. We adopt a systems-level approach exploiting the complementary aspects of inertial sensing and visual perception from a calibrated pose-instrumented platform. This systems-level strategy yields a robust solution to underwater imaging that overcomes many of the unique challenges of a marine environment (e.g., unstructured terrain, low-overlap imagery, moving light source). Our large-area SLAM algorithm recursively incorporates relative-pose constraints using a view-based representation that exploits exact sparsity in the Gaussian canonical form. We show that our algorithmic formulation is inherently sparse unlike other feature-based canonical SLAM algorithms, which impose sparseness via pruning approximations. In particular, we investigate the sparsification methodology employed by sparse extended information filters (SEIFs) and offer new insight as to why, and how, its approximation can lead to inconsistencies in the estimated state errors. In summary, our work advances the current state-of-the-art in underwater visual navigation by demonstrating end-to-end automatic processing of the largest visually navigated dataset to date using data collected from a survey of the RMS Titanic (path length over 3 kilometers, and 3100 square meters of mapped area). This accomplishment embodies the summed contributions to several current SLAM research issues including scalability, 6 degree of freedom motion, unstructured environments, and visual perception.

April 20, 2006

We discuss how to obtain the accurate and globally consistent self-calibration of a distributed camera network, in which cameras and processing nodes may be spread over a wide geographical area, with no centralized processor and limited ability to communicate a large amount of information over long distances. First, we describe how to estimate the vision graph for the network, in which each camera is represented by a node, and an edge appears between two nodes if the two cameras jointly image a sufficiently large part of the environment. We propose an algorithm in which each camera independently composes a fixed-length message that is a lossy representation of a subset of detected features, and broadcasts this “feature digest” to the rest of the network. Each receiver camera decompresses the feature digest to recover approximate feature descriptors, robustly estimates the epipolar geometry to reject outliers and grow additional matches, and decides whether sufficient evidence exists to form a vision graph edge. Second, we present a distributed camera calibration algorithm based on belief propagation, in which each camera node communicates only with its neighbors in the vision graph. The natural geometry of the system and the formulation of the estimation problem give rise to statistical dependencies that can be efficiently leveraged in a probabilistic framework. The camera calibration problem poses several challenges to information fusion, including missing data, overdetermined parameterizations, and non-aligned coordinate systems. We demonstrate the accurate and consistent performance of the vision graph generation and camera calibration algorithms using a simulated 60-node outdoor camera network.

April 21, 2006

Graph partitioning is one of the fundamental optimization problems that has been extensively studied both in theory and practice. In this talk, we shall consider the problem of approximating the sparsest cuts in graphs. Almost all the previous algorithms for this problem, use either eigenvector computations, multi-commodity flows, or solving semi-definite programs as sub-routines. While eigenvector based approaches yield poor approximation guarantees, the multi-commodity flows or SDP based algorithms are slow. We show that the sparsest cuts can be approximated within a factor of O(log2 n) using single commodity max-flow computations which are fast in theory and perhaps more so in practice. Our algorithm iteratively computes max-flows to embed a complete graph and yields a certificate of expansion. Our technique can also be extended to obtain fast approximations for the balanced separator problem. This talk is a sequel of my previous talk and is a joint work with Satish Rao and Umesh Vazirani (UC Berkeley).

April 27, 2006

Volume rendering is an important visualization technique for medical researchers but is computationally expensive. Most labs have multiple computers available but no one computer may have the ability to render at sufficient speeds. Towards this end I have developed an algorithm and implementation for distributing volume rendering over a network of heterogeneous computers that efficiently utilizes the available computational resources. Additionally the system dynamically adapts to distributed user load so as to minimize the performance impact on local users of the remote systems.

May 12, 2006

Obtaining the complete set of proteins for each eukaryotic organism is an important step in the process of better understanding how life evolves and functions. The complex physiology of eukaryotic cells, however, makes direct observation of proteins and their parent genes difficult to achieve. An organism’s genome provides the raw data, which contains the ‘hidden’ set of instructions for generating the complete protein set. Computational gene prediction systems, therefore, play an important role in identifying protein sets using information extracted from the sequenced genome.

This talk will discuss the problem of computational gene prediction in eukaryotic genomes and present a framework for predicting single isoform protein coding genes and overlapping alternatively spliced exons. Incorporating diverse sources of gene structure evidence is shown to lead to substantial improvements in prediction accuracy with performance beginning to match the accuracy of expert human annotators. Alternative exon prediction experiments are discussed, which show accurate prediction of alternatively spliced exons in known genes without relying on evidence from gene expression data.

May 16, 2006

We study the Unsplittable Flow Problem (ufp) on line graphs and cycles, focusing on the long-standing open question of whether the problem is APX-hard.We describe a deterministic quasi-polynomial time approximation scheme for ufp on line graphs, thereby ruling out an APX-hardness result, unless $$\mathrm{NP} ceq \mathrm{DTIME}(2^{polylog(n)})$$. Our result requires a quasi-polynomial bound on all edge capacities and demands in the input instance. We extend this result to undirected cycle graphs.

Earlier results on this problem included a polynomial time$$(2+eps)$$-approximation under the assumption that no demand exceeds any edge capacity (the “no-bottleneck assumption”) and a super-constant integrality gap if this assumption did not hold. Unlike most earlier work on ufp, our results do not require a no-bottleneck assumption.

This is a joint work with Nikhil Bansal, Amit Chakrabarti and Baruch Schieber.

June 8, 2006

The talk deals with vertex coloring algorithms: given a graph G, find a coloring of the vertices so that no two neighbors in G have the same color. Distributed algorithms that find a (Delta+1)-coloring in a logarithmic number of communication rounds, with high probability, are known since more than a decade. But what if the edges have orientations, i.e., the endpoints of an edge agree on its orientation (while bits may still flow in both directions)?

Interestingly, for the cycle in which all edges have the same orientation, we show that a simple randomized algorithm can achieve a 3-coloring with only O(sqrt{log n}) rounds of bit transmissions, with high probability (w.h.p.). This result is tight because we also show that the bit complexity of coloring an oriented cycle is Omega(sqrt{log n}), w.h.p., no matter how many colors are allowed. The 3-coloring algorithm can be easily extended to provide a (Delta+1)-coloring for oriented graphs of maximum degree Delta in O(sqrt{log n}) rounds of bit transmissions, w.h.p., if Delta is a constant, and the graph does not contain an oriented cycle of length less than sqrt{log n}. Using more complex algorithms, we show how to obtain an O(Delta)-coloring for arbitrary oriented graphs with maximum degree Delta and with no oriented cycles of length at most sqrt{log n}, using essentially O(log Delta+ sqrt{log n}) rounds of bit transmissions.