## Spring 2008

*Student Seminar*

January 8, 2008

Recent advances in graphics hardware technology have transformed Graphics Processing Units (GPUs) as powerful parallel machines. Modern GPUs consist of hundreds of processors with high performance memory. In order to take advantage of these hardwares, we need to design data structure and algorithms that adapt well to this architecture. Due to the advances in 3D scanning technology, increasingly detailed 3D models are used in movies, games or simulation. These models range from millions to billions of triangles. In this talk, I will show how to process these 3D models for efficient processing by modern GPUs. I will present seamless texture atlas (STA) as a new GPU-friendly efficient representation for 3D geometry suitable for large data processing. Using STA, we push state-of-the-art in three graphics applications: seamless texturing of large 3D models, view-dependent visualization of huge models and geometry processing of 3D models.

**Speaker Biography**: Budirijanto Purnomo was born in Pontianak, Indonesia. After completing his high school in Jakarta, Indonesia, he came to United States for advanced study. He obtained BS and MS in Computer Science from Michigan Tech University in 1999 and 2001. Then, he pursued a Ph.D. in CS with specialization in computer graphics at Johns Hopkins University. During his study, he completed two internships with Lawrence Livermore National Laboratory and Pixar Animation Studios. After graduation, he will join 3D Application Research Group at AMD. His research interests are interactive computer graphics, visualization and compiler optimization. Read more

January 31, 2008

Imagine having to choose between a few compression schemes to compress a very long file. Before deciding on the scheme, you might want to obtain a rough estimate of how well each scheme performs on your file. Another motivating scenario comes from areas, such as data-mining, natural language processing and genomics, where compression schemes are used as tools for measuring properties of strings such as similarity and entropy. In these applications, one typically needs only the length of the compressed version of a file, not the output itself.

In this talk, we consider the question of approximating compressibility of a string with respect to a fixed compression scheme, in sublinear time. We will concentrate on the run-length encoding and a version of Lempel-Ziv as our example compression schemes. We present algorithms and lower bounds for approximating compressibility with respect to both schemes.

The second problem we consider is approximating the support size of a distribution from a small number of samples, when each element in the distribution’s support appears with probability at least 1/n. Our motivation is tight relationship between this problem and estimating compressibility with respect to Lempel-Ziv. However, estimating distribution support size is also considered in other areas of computer science along with an important variant, estimating the number of distinct elements in a sequence of length n.

For all three problems (distribution support, distinct elements and Lempel-Ziv compressibility), we prove a nearly linear (in n) lower bound on the query complexity, applicable even for approximation with additive error. At the heart of the lower bound is a construction of two positive integer random variables, X and Y, with very different expectations and the following condition on the first k moments: E[X]/E[Y] = E[X2]/E{Y2] = … = E[X^k]/E[Y^k]. Our lower bound method is also applicable to other problems. In particular, it gives a new lower bound for the sample complexity of approximating the entropy of a distribution.

Based on joint work with Dana Ron, Ronitt Rubinfeld and Adam Smith, published in RANDOM 2007 and joint work with Dana Ron, Amir Shpilka and Adam Smith, published in FOCS 2007.

February 7, 2008

Homology search, finding similar parts between two sequences, is the most fundamental and popular task in bioinformatics. Traditional homology search technology is either too slow or too insensitive. When it does return something, the results are simply some non-specific fragments of alignments. We introduce new ideas, including a new mathematical theory of optimized spaced seeds, that allow modern homology search to achieve high sensitivity, high specificity, and high speed simultaneously. The spaced seed methodology is implemented in our PatternHunter software, as well as most other modern homology search software, serving thousands of queries daily. We also introduce ZOOM that maps short reads of 5x coverage of a human genome in a CPU-day.

Joint work with Bin Ma, John Tromp, X.F. Cui, B. Brejova, T. Vinar, D. Shasha

**Speaker Biography**: Ming Li is a Canada Research Chair in Bioinformatics and professor of Computer Science at the University of Waterloo. He is a fellow of Royal Society of Canada, ACM, and IEEE. He is a recipient of Canada’s E.W.R. Steacie Fellowship Award in 1996, and the 2001 Killam Fellowship. Together with Paul Vitanyi they have pioneered the applications of Kolmogorov complexity and co-authored the book “An Introduction to Kolmogorov Complexity and Its Applications”. His main research focus recently is protein structure prediction.

February 21, 2008

A “verifying compiler” (a tool that, in its purest form, would compile only programs that it could prove correct) has long been a dream of the software formal methods community. Jim King apparently coined the term in his 1969 Ph.D. thesis, and 35 years later Tony Hoare proposed the development of a verifying compiler as a “Grand Challenge for Computing Research”. What have we been doing all this time?!? Visions of verification foundations and technology that would qualify as meeting Hoare’s challenge have driven the research of the OSU Reusable Software Research Group for two decades. In this talk, I will try to explain how a verifying compiler might work, why it would be a significant advance over current practice, why it is still considered a “challenge”, and why there is hope that a verifying compiler (hence, verified software) is truly feasible.

**Speaker Biography**: Bruce W. Weide is Professor and Associate Chair of Computer Science and Engineering at The Ohio State University, where he co-directs the Reusable Software Research Group with Bill Ogden. His research interests include all aspects of software component engineering, especially in applying RSRG work to practice and in teaching its principles to beginning CS students. He and colleague Tim Long were awarded the IEEE Computer Society’s 2000 Computer Science and Engineering Undergraduate Teaching Award for their work in the latter area. Weide holds a Ph.D. in Computer Science from Carnegie Mellon University and a B.S.E.E. from the University of Toledo. He is a member of the IEEE, ACM, and UCS.

February 28, 2008

Proteins are large, complex molecules that perform the critical biological functions of molecular recognition, catalysis, and transport and energy conversion. They can do this because they are able to adopt stable three-dimensional structures that hold functional groups in precise orientations consistent with their function. They are like tiny assembly-line robots that recognize specific molecules and alter them according to their DNA-programmed function. Proteins are very difficult to design because they adopt their three-dimensional structures through a very complex process called protein-folding. If we are to develop the ability to rapidly create large molecules that can carry out complex functions like proteins do, we need to develop new ways of synthesizing large molecules with programmable shapes.

Over the past seven years, my laboratory has developed a radical new approach to creating large, complex molecules that can carry out complex functions in the way that biological proteins do. Our approach is to synthesize new stereochemically pure cyclic bis-amino acid building blocks (Molecular Lego) that we couple through pairs of amide bonds to create large molecules with programmed shapes. The oligomers are efficiently assembled on solid support using peptide synthesis techniques to first create a flexible oligomer that is then rigidified by the simultaneous formation of a second set of amide bonds between each adjacent pair of monomers. The structure of the resulting ladder-like molecules is controlled by the sequence and stereochemistry of the component monomers. The oligomer structures made accessible by this technology range from extended molecular rods to compact structures containing small-molecule sized cavities. These oligomers can be functionalized in a variety of ways to carry out different functions.

We are developing a software package that will enable anyone to construct oligomer sequences with designed shapes, either interactively or through automated computer searches of sequence space. Our goal is to combine this software and molecular “hardware” to create a sort of “molecular compiler” that can take a high level specification of a function (eg: catalyze a particular reaction) and convert it into a molecular implementation that will carry out that function.

**Speaker Biography**: Chris Schafmeister recently joined Temple University as an Associate Professor in Chemistry. Prior to that he was an Assistant Professor at the University of Pittsburgh and a Postdoctoral Fellow at Harvard. He received his PhD in Biophysics from UCSF. Among other awards, Chris received the Feynman Prize for Experimental Nenotechnology in 2005. His current research focus is on developing applications for functional macromolecules constructed from asymmetric molecular building blocks.

March 4, 2008

Algorithms are different from programs and should not be described with programming languages. For example, algorithms are usually best described in terms of mathematical objects like sets and graphs instead of the primitive objects like bytes and integers provided by programming languages. Until now, the only simple alternative to programming languages has been pseudo-code. +CAL is an algorithm language based on TLA+. A +CAL algorithm is automatically translated to a TLA+ specification that can be checked with the TLC model checker or reasoned about formally. (No knowledge of TLA+ is assumed). +CAL makes pseudo-code obsolete.

**Speaker Biography**: Dr. Lamport is best known as the author of LaTeX, a document formatting system for people who write formulas instead of drawing pictures. This naturally led him to join Microsoft, a company with little interest in such people. He is also known for writing: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” which established him as an expert on distributed systems. Among his other contributions is the TLA+ specification language–a Quixotic attempt to overcome engineers’ fear of and computer scientists’ antipathy towards mathematics.

March 13, 2008

This talk presents an optimal-rate routing protocol in the public-key setting for synchronous networks in the presence of a malicious poly-time adversary. Namely, even if the majority of the nodes are controlled by a malicious adversary and the topology of the network is changing at each round, then as long as there is some path of non-corrupted nodes connecting a sender and a receiver at each round (though this path may change every round), we construct a routing protocol with bounded memory per processor that achieves optimal packet transfer rate.

The routing protocol assumes no knowledge of which nodes are corrupted nor which path is reliable at any round. Our result, which can be interpreted as the first interactive coding theorem for malicious networks, states that the protocol cannot be affected in a meaningful way by any polynomial-time malicious adversary whose goal is to disrupt the communication as much as possible, including (but not limited to) deletion, insertion, and replacement of any content, not responding or responding incorrectly to information requests, or any other arbitrary deviation from prescribed protocol rules. The talk will be self-contained and accessible to the general audience.

Joint work with Yair Amir and Rafail Ostrovsky

**Speaker Biography**: Paul Bunn is a fifth-year graduate student in the math department at the University of California, Los Angeles, working with Prof. Rafail Ostrovsky. Paul’s research focuses on problems in number theory and theoretical computer science, including cryptography, networks and distributed algorithms. He was awarded NSF VIGRE research fellowships in 2006 and 2007, is an active member of the Cryptography research group and Number Theory group at UCLA, and is a Sigma Xi member since 2001. Paul was one of six graduate students recognized by the math department for excellence in teaching in 2006. Paul graduated with distincition from Duke University in 2001 with a double-major in physics and computer science.

March 25, 2008

Web search engines sift through billions of documents to identify those most likely to be relevant to a short natural language query. This functionality can be exploited to relate queries not just to documents but also to other concepts and queries. In this talk, we describe several applications of this principle, including the generation of query refinement suggestions for interactive search assistance and the discovery of alternative descriptors for an advertiser’s product space.

**Speaker Biography**: Peter Anick is a member of the Applied Sciences group at Yahoo! where he currently works on developing infrastructure and tools for supporting online query assistance, such as Yahoo’s recently released “Search Assist” product. He received his PhD in computer science from Brandeis University in 1999. Prior to that, he worked for many years in Digital Equipment Corporation’s Artificial Intelligence Technology groups on applications of computational linguistics, including online text search for customer support and natural language interfaces for expert and database systems, and subsequently at AltaVista and Overture. His research interests include intelligent information retrieval, user interfaces for exploratory search, text data mining and lexical semantics of nouns and noun compounds. He is a member of ACM SIGIR, former editor of SIGIR Forum and current workshops program chair for SIGIR’08.

April 17, 2008

The *quality* of an embedding $$\Phi: V \mapsto \mathbb{R}2$$ of a graph $$G = (V, E)$$ into the Euclidean plane is the ratio of $$\max_{{u, v} \in E} \|\Phi(u) – \Phi(v)\|$$ to $$\min_{{u, v} \not\in E} \|\Phi(u) – \Phi(v)\|$$. This talk will focus on the problem of producing a “good” quality embedding of a given *unit disk graph* (UDG). Any UDG, by definition, has an embedding with quality less than 1. We present a polynomial time algorithm that computes a $$O(\log^{2.5} n)$$-quality embedding of a given UDG. A key step of this algorithm is the construction of a “growth-restricted approximation” of the given UDG.

Our problem is a version of the well known *localization* problem in wireless sensor networks, in which network nodes are required to compute virtual 2-dimensional Euclidean coordinates given little or (as in our case) no geometric information.

*Student Seminar*

April 22, 2008

Building a statistical language model (LM) is a challenging task that involves handling unseen events and assigning proper nonzero probabilities to all possible word sequences based on only limited observations. We investigate data sharing with statistical/linguistic knowledge encoded as word classes/labels. A maximum entropy token-based language model (METML) is proposed as a framework to incorporate word label information in language modeling. While the conventional LMs model word sequencies, a METLM directly predict distributions of tokens (a coupling of a word and its labels ). The probability of a word sequence is computed by marginalization where all its possible token sequences realization are considered. With features capturing explicit local linguistic dependencies based on words/labels n-grams, this model avoids further data sparseness with the more specific tokens. Moreover, it also enables data sharing through large-granularity word labels which can be either syntactic word tags or semantic word classes. We also explore semantic data sharing by investigating automatic semantic tagging of all nouns in the corpus with a unique set of labels. We address large scale semantic analysis on a refined set of semantic labels defined in the Longman dictionary. We employ maximum entropy based classifiers and random forest based classifiers as basic tagging techniques and compare their performance. Many modeling strategies with various linguistically motivated or data-driven indicators are proposed, examined and compared. Automatic semantic labels are further evaluated in language modeling.

**Speaker Biography**: Jia Cui is a PhD student in the Computer Science Department, a member of the Center for Language and Speech Processing (CLSP) at the Johns Hopkins University. She received her BS degree in Computer Science from University of Science and Technology of China, Hefei, China, and MS degree from Chinese Academy of Sciences, Beijing, China, in Artificial Intelligence. She now works for IBM T.J. Watson Center, Yorktown Heights, NY. Her research interests include automatic speech recognition and natural language processing.

April 24, 2008

In this talk I describe my previous work on storage-based intrusion detection. This technique preserves system integrity by allowing storage systems components to watch for and respond to data modifications that are characteristic of malicious intrusions. Storage systems are able to spot several common intruder actions, such as an intruder adding backdoors, inserting Trojan horses, and tampering with audit logs. Further, an intrusion detection system (IDS) embedded in a storage device continues to operate effectively even after client systems or network devices are compromised.

April 29, 2008

The increased popularity of IEEE 802.11 WLANs has led to dense deployments in urban areas. Such high density leads to sub-optimal performance unless the interfering networks learn how to optimally share the spectrum. In this talk I will describe three distributed algorithms that allow (i) multiple interfering 802.11 WLANs to select their operating frequency in a way that minimizes global interference, (ii) clients to choose their Access Point so that the bandwidth of all interfering networks is shared optimally, (iii) APs and clients to select their transmission power and Clear Channel Assessment (CCA) threshold so as to maximize the overall network capacity. All algorithms fall into a unified framework based on Gibbs sampling and optimize global network performance based on local information. They do not require explicit coordination among the wireless devices, and can thus operate in diverse cooperative environments with no single administrative authority. We study implementation requirements and show that significant benefits can be gained using a prototype implementation on the Intel 2915 ABG wireless network interface cards.

**Speaker Biography**: Konstantina (Dina) Papagiannaki received her first degree in electrical and computer engineering from the National Technical University of Athens, Greece, in 1998, and her PhD degree from the University College London, U.K., in 2003. From 2000 to 2004, she was a member of the IP research group at the Sprint Advanced Technology Laboratories. She is currently with Intel Research in Pittsburgh, after 3 years at Intel Research in Cambridge, UK. Her research interests are in Internet measurements, modeling of Internet traffic, security, network design and planning, and infrastructure wireless networks.

May 1, 2008

The introduction of randomization in the design and analysis of algorithms for matrix computations (such as matrix multiplication, least-squares regression, the Singular Value Decomposition (SVD), etc.) over the last decade provided a new paradigm and a complementary perspective to traditional numerical linear algebra approaches. These novel approaches were motivated by technological developments in many areas of scientific research that permit the automatic generation of large data sets, which are often modeled as matrices. In this talk we will focus on the Singular Value Decomposition (SVD) of matrices and the related Principal Components Analysis, which have found numerous applications in extracting structure from datasets in diverse domains, ranging from the social sciences and the internet to biology and medicine. The extracted structure is expressed in terms of singular vectors, which are linear combinations of all the input data and lack an intuitive physical interpretation. We shall discuss matrix decompositions which express the structure in a matrix as linear combination of actual columns (or rows) of the matrix. Such decompositions are easier to interpret in applications, since the selected columns and rows are subsets of the data. Our (randomized) algorithms run in cubic time and come with strong accuracy guarantees. Finally, we will demonstrate how these decompositions may be applied in order to identify ancestry-informative DNA markers that may be used to assign individuals to populations of origin.

**Speaker Biography**: Prof. Drineas is an assistant professor in the Computer Science Department of Rensselaer Polytechnic Institute, which he joined in 2003. Prof. Drineas earned a doctorate in computer science from Yale University in 2003 and a bachelor in computer engineering from the University of Patras, Greece, in 1997. His research interests lie in the area of the design and analysis of algorithms, and in particular the design and analysis of randomized algorithms for linear algebraic problems. Prof. Drineas is the recipient of an NSF CAREER award.