## Spring 2013

*Distinguished Lecturer*

January 31, 2013

Mathematical logic was developed in an effort to provide formal foundations for mathematics. In this quest, which ultimately failed, logic begat computer science, yielding both computers and theoretical computer science. But then logic turned out to be a disappointment as foundations for computer science, as almost all decision problems in logic are either undecidable or intractable. Starting from the mid 1970s, however, there has been a quiet revolution in logic in computer science, and problems that are theoretically undecidable or intractable were shown to be quite feasible in practice. This talk describes the rise, fall, and rise of logic in computer science, describing several modern applications of logic to computing, include databases, hardware design, and software engineering.

**Speaker Biography**: Moshe Vardi is the George Distinguish Service Professor in Computational Engineering and Director of the Ken Kennedy Institute for Information Technology Institute at Rice University. He is the co-recipient of three IBM Outstanding Innovation Awards, the ACM SIGACT Goedel Prize, the ACM Kanellakis Award, the ACM SIGMOD Codd Award, the Blaise Pascal Medal, the IEEE Computer Society Goode Award, and the EATCS Distinguished Achievements Award. He is the author and co-author of over 400 papers, as well as two books: Reasoning about Knowledge and Finite Model Theory and Its Applications. He is a Fellow of the Association for Computing Machinery, the American Association for Artificial Intelligence, the American Association for the Advancement of Science, and the Institute for Electrical and Electronic Engineers. He is a member of the US National Academy of Engineering, the American Academy of Arts and Science, the European Academy of Science, and Academia Europea. He holds honorary doctorates from the Saarland University in Germany and Orleans University in France. He is the Editor-in-Chief of the Communications of the ACM.

February 1, 2013

An “oblivious subspace embedding” (OSE) is a distribution over matrices S such that for any low-dimensional subspace V, with high probability over the choice of S, ||Sx||_2 approximately equals ||x||_2 (up to 1+eps multiplicative error) for all x in V simultaneously. Sarlos in 2006 pioneered the use of OSE’s for speeding up algorithms for several numerical linear algebra problems. Problems that benefit from OSE’s include: approximate least squares regression, low-rank approximation, approximating leverage scores, and constructing good preconditioners.

We give a class of OSE distributions we call “oblivious sparse norm-approximating projections” (OSNAP) that yield matrices S with few rows that are also extremely sparse, yielding improvements over recent work in this area by Clarkson and Woodruff. In particular, we show S can have O(d^2) rows and 1 non-zero entry per column, or even O(d^{1+gamma}) rows and O*gamma(1) non-zero entries per column for any desired constant gamma>0. When applying the latter bound for example to the approximate least squares regression problem of finding x to minimize ||Ax – b||*2 up to a constant factor, where A is n x d for n >> d, we obtain an algorithm with running time O(nnz(A) + d^{omega + gamma}). Here nnz(A) is the number of non-zero entries in A, and omega is the exponent of square matrix multiplication.

Our main technical result is essentially a Bai-Yin type theorem in random matrix theory and is likely to be of independent interest: i.e. we show that for any U in R^{n x d} with orthonormal columns and random sparse S with appropriately chosen entries and sufficiently many rows, all singular values of SU lie in the interval [1-eps, 1+eps] with good probability.

Joint work with Huy Lê Nguyễn (Princeton).

**Speaker Biography**: Jelani Nelson is a postdoctoral researcher in theoretical computer science at the Institute for Advanced Study, and will begin as an Assistant Professor of Computer Science at Harvard University in Fall 2013. Jelani’s research interests include algorithms and complexity theory, with specific focuses on algorithms for processing massive datasets and pseudorandomness.

February 5, 2013

The nature of signal processing and machine learning has evolved dramatically over the years as we try to investigate increasingly intricate, dynamic and large-scale systems. This development is accompanied by an explosion of massive, unlabeled, multimodal, corrupted and very high-dimensional “big data”, which poses new challenges for efficient analysis and learning. In this talk, I will advocate a learning approach based on “stochastic approximation”, wherein a single data point is processed at each iteration using a computationally simple update, to address these challenges. I will start by presenting a stochastic approximation (SA) meta-algorithm for unsupervised learning with large high-dimensional datasets. I will then describe the application of the SA algorithm to a multiview learning framework, where multiple modalities are available at the time of training but not for prediction at test time, and a similarity-based learning framework where data is observed only in the form of pairwise similarities. I will conclude with a theoretical analysis of the SA algorithm and a discussion about the pitfalls of SA approaches and the remedies thereof.

**Speaker Biography**: Raman Arora received his B.E. degree from NSIT, Delhi, India, in 2001, and M.S. and Ph.D. degrees from the University of Wisconsin-Madison in 2005 and 2009, respectively. He worked as a Research Associate at University of Washington, Seattle, from 2009 to 2011 and was a visiting researcher at Microsoft Research (MSR) during the summer of 2011. He is currently a Postdoctoral Researcher at the Toyota Technological Institute at Chicago. His research interests include online learning, large-scale machine learning, speech recognition and statistical signal processing.

February 7, 2013

It is challenging to write software that fits the energy budget of low-power devices, but it is even more difficult to preserve the security and sanity of the system at the same time.

Recent years have witnessed a proliferation of low-power embedded devices with a power range of few milliwatts to microwatts. The capabilities of the embedded systems continue to improve dramatically as their size shrinks; however, improvements in battery density and energy harvesting have failed to mimic a Moore’s law. Thus, energy remains a formidable bottleneck for trustworthy and low-power embedded computing. Instead of trying to create hardware with ideal energy proportionality, my research explores the effectiveness of software techniques that bend digital abstractions in order to reduce energy consumption while protecting program security and semantics.

My talk will cover three research contributions that unleash energy otherwise squandered on communication, storage, and time keeping:

- TARDIS [USENIX Security’12], which provides a trustworthy notion of time to low-power systems without clocks by exploiting the decay properties of SRAM,
- CCCP [USENIX Security’09], which offers an energy-efficient storage alternative to the local non-volatile storage by relying on cryptographic backscatter radio communication,
- Half-Wits [USENIX FAST’11], which allows for operating an embedded system at below-spec supply voltages in order to reduce energy consumption while maintaining storage reliability using software-only coding algorithms.

**Speaker Biography**: Mastooreh (Negin) Salajegheh is a postdoctoral research associate at the University of Virginia working with Kevin Skadron. She received her Ph.D. degree in 2012 from the University of Massachusetts Amherst, under the supervision of Kevin Fu. During her internship with Jie Liu’s Sensing and Energy Research Group (SERG) at Microsoft Research, Negin researched trustworthy operation of Near Field Communication (NFC). She has received the Outstanding Synthesis Project award (Sponsored by Yahoo) for her work on probabilistic storage. Negin was one of the top four finalists of UMass Innovation Challenge Award. Her research focuses on low-power and trustworthy operation of pervasive computers, energy management, and probabilistic storage. Her research goal is to make low-power computers (specially batteryless ones) more secure and energy-aware. During her PhD studies, Negin served as the co-chair of CS Women group at UMass Amherst for a year and she has attended several outreach events for girls in IT.

February 12, 2013

Making sense of digital information is a growing problem in almost every domain, ranging from scientists needing to stay current with new research, to companies aiming to provide the best service for their customers. What is common to such “Big Data” problems is not only the scale of the data, but also the complexity of the human processes that continuously interact with digital environments and generate new data.

Focusing on learning problems that arise in retrieval and recommender systems, I will show how we can develop principled approaches that explicitly model the process of continuously learning with humans in the loop while improving system utility. As one example, I will present the linear submodular bandits problem, which jointly addresses the challenges of selecting optimally diversified recommendations and balancing the exploration/exploitation tradeoff when personalizing via user feedback. More generally, I will show how to integrate the collection of training data with the user’s use of the system in a variety of applications, ranging from long-term optimization of personalized recommender systems to disambiguation within a single search session.

**Speaker Biography**: Yisong Yue is a postdoctoral researcher in the Machine Learning Department and the iLab at Carnegie Mellon University. His research interests lie primarily in machine learning approaches to structured prediction and interactive systems, with an application focus in problems pertaining information systems. He received a Ph.D. from Cornell University and a B.S. from the University of Illinois at Urbana-Champaign. He is the author of the SVM-map software package for optimizing mean average precision using support vector machines. His current research focuses on machine learning approaches to diversified retrieval and interactive information retrieval.

February 14, 2013

The Domain Name System (DNS) is a critical component of the Internet. DNS provides the ability to map human memorable and human readable domain names to machine-level IP addresses and other records. These mappings lie at the heart of the Internet’s success and are essential for the majority of core Internet applications and protocols.

The critical nature of DNS often makes it the target of direct cyber-attacks and other forms of abuse. Cyber-criminals rely heavily upon the reliability and scalability of the DNS protocol to serve as an agile platform for their illicit network operations. For example, modern malware and Internet fraud techniques rely upon the DNS to locate their remote command-and-control (C&C) servers through which new commands from the attacker are issued, serve as exfiltration points for the information stolen from the victim’s computer and to manage subsequent updates to their malicious toolset.

In this talk I will discuss new research that addresses problems in the area of DNS-based detection of illicit operations. In detail, I will elaborate on methods that quantify and track dynamically changing reputations for DNS based on passive network measurements. Next, I will discuss two new research systems that enable the creation of an early warning platform for DNS. These early warning systems enables the research community to identify emerging threats (e.g., new botnets and malware infections) across the DNS hierarchy in a timelier manner.

**Speaker Biography**: Manos Antonakakis received his engineering diploma in 2004 from the University of the Aegean, Department of Information and Communication Systems Engineering. From November 2004 up to July 2006, he was working as a guest researcher at the National Institute of Standards and Technology (NIST-DoC), in the area of wireless ad hoc network security, at the Computer Security Division. In May 2012 he received his PhD in computer science from Georgia Institute of Technology under Professor Wenke Lee’s supervision. He currently works at Damballa as the Sr. Director of Research. Dr. Antonakakis research interests are network and computer security, where some of his active research projects are on attack attribution, ISP/cellular traffic analysis, DNS data mining, botnet metrics, DNS caching and DNS reputation systems.

February 19, 2013

A major challenge in machine learning is to reliably and automatically discover hidden structure in data with little or no human intervention. Many of the core statistical estimation problems of this type are, in general, provably intractable for both computational and information-theoretic reasons. However, much progress has been made over the past decade or so to overcome these hardness barriers by focusing on realistic cases that rule out the intractable instances. In this talk, I’ll describe a general computational approach for correctly estimating a wide class of statistical models, including Gaussian mixture models, Hidden Markov models, Latent Dirichlet Allocation, Probabilistic Context Free Grammars, and several more. The scope of the approach extends beyond the purview of previous algorithms; and it leads to both new theoretical guarantees for unsupervised learning, as well as fast and practical algorithms for large-scale data analysis.

**Speaker Biography**: Daniel Hsu is a postdoc at Microsoft Research New England. Previously, he was a postdoc with the Department of Statistics at Rutgers University and the Department of Statistics at the University of Pennsylvania from 2010 to 2011, supervised by Tong Zhang and Sham M. Kakade. He received his Ph.D. in Computer Science in 2010 from UC San Diego, where he was advised by Sanjoy Dasgupta; and his B.S. in Computer Science and Engineering in 2004 from UC Berkeley. His research interests are in algorithmic statistics and machine learning.

February 21, 2013

Our online communications are plagued by increasing threats to security and privacy. Sophisticated surveillance technologies can compromise user privacy, and the insecurity of network protocols threatens the safety of our critical infrastructure. In this talk, I argue that network science can play an important role in cybersecurity by illustrating how understanding and manipulating structural properties of networks can inform the design of trustworthy communication systems.

First, I will discuss how network structure can be leveraged to detect and isolate malicious (Sybil) accounts in online social networks. The SybilInfer system that I developed uses this approach by exploiting differences in mixing properties between benign accounts and malicious accounts. SybilInfer demonstrates how graph theoretic machine learning techniques can be applied to security problems. Second, I will discuss how specially designed network structures can help protect users’ privacy by enabling them to communicate anonymously. The ShadowWalker system that I developed for anonymous communication is built around a novel network topology, which is both fast mixing and inherently verifiable. This allows ShadowWalker to scale to millions of users while being resilient to attacks on user privacy. Finally, I will conclude by highlighting the potential of leveraging complex network structures in a broad range of security and privacy problems.

**Speaker Biography**: Prateek Mittal is a postdoctoral scholar in Electrical Engineering and Computer Sciences at the University of California, Berkeley. His research focuses on building secure and privacy-preserving systems, drawing on techniques from applied cryptography, distributed systems, large scale machine learning and network science. His work has influenced the design of widely-used systems such as the Tor network. He received the M.E. Van Valkenburg graduate research award for outstanding doctoral research, the Rambus Computer Engineering fellowship, and the ACM CCS 2008 outstanding paper award. He earned his M.S. and Ph.D. in Electrical and Computer Engineering at the University of Illinois at Urbana-Champaign.

February 26, 2013

Predictions in modern statistical inference problems can be increasingly understood in terms of discrete structures such as arrangements of objects in computer vision, phonemes in speech recognition, parses in natural language processing, or molecular structures in computational biology. For example, in image scene understanding one needs to jointly predict discrete semantic labels for every pixel, e.g., whether it describes a person, bicycle, bed, etc. In a fully probabilistic treatment, all possible alternative assignments are considered thus requiring to estimate exponentially many structures with their respective weights. To relax the exponential complexity we describe two different approaches: Dual decomposition (e.g., convex belief propagation) and predictions under random perturbations. The second approach leads us to a new approximate inference framework that is based on max-statistics which capture long-range interactions, contrasting the current framework of dual decomposition that relies on pseudo-probabilities. We demonstrate the effectiveness of our approaches on different complex tasks, outperforming the state-of-the-art results in scene understanding, depth estimation, semantic segmentation and phoneme recognition.

**Speaker Biography**: Tamir Hazan received his PhD from the Hebrew University of Jerusalem (2009) and he is currently a research assistant professor at TTI Chicago. Tamir Hazan’s research describes efficient methods for reasoning about complex models. His work on random perturbations was presented in the machine learning best papers track at AAAI 2012. Tamir Hazan’s research also includes the primal-dual norm-product belief propagation algorithm which received a best paper award at UAI 2008. Currently, these techniques outperform the state-of-the-art in different computer vision and language processing tasks.

March 5, 2013

Many optimization problems in networking and distributed computing have been approached with only combinatorial techniques, despite the prevalence of mathematical programming in modern approximation algorithms. In this talk I will discuss new algorithms for some problems in network design, namely graph spanners, that use advanced convex relaxation-based techniques rather than the traditional combinatorial approaches.

Graph spanners are subgraphs that approximately preserve distances between nodes, and are a fundamental building block in distributed computing in addition to being useful in areas as diverse as efficiently solving linear systems and property testing of functions. Our new techniques have allowed us to solve fundamental algorithmic problems on spanners that have been open for almost 20 years.

I will also demonstrate the broader relevance of our techniques, including to more applied problems such as designing good overlays for autonomous systems running the Interior Border Gateway Protocol (IBGP). Finally, I will discuss some recent progress on complexity-theoretic lower bounds for approximating similar problems, based on new constructions of probabilistically-checkable proofs.

**Speaker Biography**: Michael Dinitz received his A.B. in computer science from Princeton University, with a certificate in applied and computational mathematics. He received his Ph.D. in computer science from Carnegie Mellon University, where he was advised by Anupam Gupta. Since then he has been a postdoctoral fellow in the Faculty of Mathematics and Computer Science at the Weizmann Institute of Science in Israel, where he is hosted by Robert Krauthgamer and David Peleg. He is broadly interested in algorithms, with a particular focus on approximation algorithms for problems inspired by networking and distributed computing.

March 7, 2013

Recent technological advances have allowed us to collect genomic data on an unprecedented scale, with the promise of revealing genetic variants, genes, and pathways disrupted in many diseases. However, identifying relevant genetic factors and ultimately unraveling the genetics of complex traits from such high dimensional data have presented significant statistical and computational challenges. In this domain, where spurious associations and lack of statistical power are major factors, I have developed machine learning methods based on robust probabilistic models that leverage biological structure, prior knowledge, and diverse sources of evidence. In particular, I have developed Bayesian methods that utilize structured information, including gene networks and cellular pathways, and transfer learning to propagate information across related genes and diseases. In this talk, I will discuss the application of such models to diverse traits including human disease and cellular traits derived from RNA-sequencing. Using this approach, I demonstrate an improvement in uncovering genetic variants affecting complex traits, along with the interactions and intermediate cellular mechanisms underlying genetic effects.

**Speaker Biography**: Alexis Battle is a PhD candidate in Computer Science at Stanford University. Her research in computational biology focuses on machine learning and probabilistic models for the genetics of complex traits. Alexis received her BS in Symbolic Systems from Stanford, and spent four years as a member of the technical staff at Google. She is the recipient of an NSF Graduate Research Fellowship and an NIH NIGMS training award.

March 12, 2013

The “human element” in data introduces fundamental algorithmic challenges such as protecting individual privacy, ensuring fairness in classification, and, designing algorithms that are robust to population outliers and adversarial conditions. This talk focuses on the fruitful interplay between privacy and robustness.

We will first give a simple and practical method for computing the principal components of a data set under the strong notion of differential privacy. Our algorithm always guarantees privacy while its utility analysis circumvents an impossibility result in differential privacy using a realistic assumption central to robust principal component analysis.

We then turn to the problem of analyzing massive data sets using few linear measurements—an algorithmic paradigm known as “linear sketching”. Here we prove a “no free lunch” theorem showing that the computational efficiency of linear sketches comes at the cost of robustness. Indeed, we show that efficient linear sketches cannot guarantee correctness on adaptively chosen inputs. Our result builds on a close connection to privacy and can be seen as a novel “reconstruction attack” in the privacy setting.

**Speaker Biography**: Moritz Hardt is a post-doctoral researcher in the theory group at IBM Research Almaden. He completed his PhD in Computer Science at Princeton University in 2011, advised by Boaz Barak. His current work focuses on the algorithmic foundations of privacy, fairness and robustness in statistical data analysis. His general research areas include algorithms, machine learning and complexity theory.

March 26, 2013

Modern cryptography has become a huge success and has enabled many important applications such as public-key encryption systems, zero-knowledge proofs and secure multi-party computation protocols. However, the security of these applications relies crucially on the assumption that we have access to truly uniform random bits. On the other hand, in reality high quality random sources may be hard to obtain, and natural random sources tend to be biased in unknown ways. The situation is made even worse by possible side channel attacks, which may leak information to an adversary and thus compromise even initially uniform secret keys. Therefore, fundamental questions are whether we can use imperfect randomness in cryptographic applications without jeopardizing their security, and whether we can protect against information leakage in secret keys. These topics have been the focus of intensive recent research.

In this talk I will describe a general and powerful tool to deal with imperfect randomness and information leakage, called “randomness extractors”. A randomness extractor is an algorithm that transforms an imperfect random source into nearly uniform random bits. I will discuss different kinds of randomness extractors and their applications in cryptography. I will then describe my new approaches in constructing these objects, which have led to significant progress and almost optimal solutions to two long standing open problems: privacy amplification with an active adversary [Maurer-Wolf’ 1997] and explicit extractors for independent weak random sources [Chor-Goldreich’ 1985]. Interestingly, my techniques also established new connections between randomness extractors and other seemingly unrelated problems such as leader election in distributed computing.

**Speaker Biography**: Xin Li is a postdoctoral researcher in Computer Science and Engineering at the University of Washington. He received his B.S. and M.S. in computer science from Tsinghua University, China and his Ph.D. in computer science from the University of Texas at Austin, where he was advised by David Zuckerman. His current work focuses on randomness extractors and their applications in leakage-resilient cryptography and information-theoretic security. His general research interests include pseudorandomness, cryptography, complexity theory and fault-tolerant distributed computing. He is the recipient of a Simons postdoctoral fellowship.

April 4, 2013

Recent advances in 3D depth cameras such as Microsoft Kinect sensors have created many opportunities towards a more natural way of interacting with computers and with people across distances. A key enabling technology is human body-language understanding by computer. Only after the computer understands what a user is doing, it can respond/act in a natural way back to the user, or capture the essential information and relay it to remote users. This has always been an active research field in computer vision but proven to be formidably difficult with video cameras. 3D depth cameras such as Microsoft Kinect sensors allow the computer to directly sense the 3rd dimension (depth) of the users and the environment, alleviating the burden of human body-language understanding by computer. In this talk, I will describe our recent work in using such commodity depth cameras for hand gesture recognition, human action recognition, facial expression tracking, engagement detection, human body modeling, and immersive teleconferencing.

**Speaker Biography**: Zhengyou Zhang received the B.S. degree in electronic engineering from Zhejiang University, Hangzhou, China, in 1985, the M.S. degree in computer science from the University of Nancy, Nancy, France, in 1987, the Ph.D. degree in computer science from the University of Paris XI, Paris, France, in 1990, and the Doctorate of Science (Habilitation à diriger des recherches) from the University of Paris XI, Paris, France, in 1994. He is a Principal Researcher with Microsoft Research, Redmond, WA, USA, and the Research Manager of the “Multimedia, Interaction, and Communication” group. Before joining Microsoft Research in March 1998, he was with INRIA (French National Institute for Research in Computer Science and Control), France, for 11 years and was a Senior Research Scientist from 1991. In 1996-1997, he spent a one-year sabbatical as an Invited Researcher with the Advanced Telecommunications Research Institute International (ATR), Kyoto, Japan. He is also an Affiliate Professor with the University of Washington, Seattle, WA, USA, and an Adjunct Chair Professor with Zhejiang University, Hangzhou, China. He has published over 200 papers in refereed international journals and conferences, and has coauthored the following books: 3-D Dynamic Scene Analysis: A Stereo Based Approach (Springer-Verlag, 1992); Epipolar Geometry in Stereo, Motion and Object Recognition (Kluwer, 1996); Computer Vision (Chinese Academy of Sciences, 1998, 2003, in Chinese); Face Detection and Adaptation (Morgan and Claypool, 2010); and Face Geometry and Appearance Modeling (Cambridge University Press, 2011). He has given a number of keynotes in international conferences and invited talks in universities.

Dr. Zhang is a Fellow of the Institute of Electrical and Electronic Engineers (IEEE), the Founding Editor-in-Chief of the IEEE Transactions on Autonomous Mental Development, an Associate Editor of the International Journal of Computer Vision, an Associate Editor of Machine Vision and Applications, and an Area Editor of the Journal of Computer Science and Technology. He served as Associate Editor of the IEEE Transactions on Pattern Analysis and Machine Intelligence from 2000 to 2004, an Associate Editor of the IEEE Transactions on Multimedia from 2004 to 2009, an Associate Editor of the International Journal of Pattern Recognition and Artificial Intelligence from 1997 to 2009, among others. He served as Area Chair, Program Chair, or General Chair of a number of international conferences, including recently a Program Co-Chair of the International Conference on Multimedia and Expo (ICME), July 2010, a Program Co-Chair of the ACM International Conference on Multimedia (ACM MM), October 2010, a Program Co-Chair of the ACM International Conference on Multimodal Interfaces (ICMI), November 2010, and a General Co-Chair of the IEEE International Workshop on Multimedia Signal Processing (MMSP), October 2011. He recently served as a founding Chair of a new track “Technical Briefs” of the ACM SIGGRAPH Asia Conference, Nov. 28 – Dec. 1st, 2012. More information is available at http://research.microsoft.com/~zhang/

*Distinguished Lecturer*

April 9, 2013

Computation, sensing, mechanisms, algorithmic frameworks and advanced manufacturing technologies are advancing at a rapid pace. Their application to medical therapies hold significant promise for improving outcomes and quality of life for patients suffering from serious disease. The current medical, economic and regulatory environment pose significant challenges to bringing innovative computational and robotic products into medical practice. This talk will discuss the early development experience of the daVinci surgical system, changes in the medical, regulatory and economic environment and point to some areas of fruitful technology research.

**Speaker Biography**: Gary S. Guthart, Ph.D. joined Intuitive Surgical in April 1996. Effective January 2010, Dr. Guthart was appointed as Chief Executive Officer. In July 2007, he was promoted to President. Prior to that, in February 2006, Dr. Guthart assumed the role of Chief Operating Officer. Prior to joining Intuitive, Dr. Guthart was part of the core team developing foundation technology for computer enhanced-surgery at SRI International (formerly Stanford Research Institute). Dr. Guthart is currently a member of the Board of Directors of Affymetrix, Inc. He received a B.S. in Engineering from the University of California, Berkeley and an M.S. and a Ph.D. in Engineering Science from the California Institute of Technology.

Dr. Guthart brings to the Board business, operating, financial and scientific experience. His service as the CEO of Intuitive enables the Board to perform its oversight function with the benefits of management’s perspectives on the business.

April 16, 2013

Heterogeneous and large volume of Electronic Health Records (EHR) data are becoming available in many healthcare institutes. Such EHR data from millions of patients serve as huge collective memory of doctors and patients over time. How to leverage that EHR data to help caregivers and patients to make better decisions? How to efficiently use these data to help clinical and pharmaceutical research?

My research focuses on developing large-scale algorithms and systems for healthcare analytics. First, I will describe our healthcare analytic research framework, which provides an intuitive collaboration mechanism across interdisciplinary teams and an efficient computation framework for handling heterogeneous patient data. Second, I will present a core component of this framework, patient similarity learning that answers the following questions:

- How to leverage physician feedback into the similarity computation?
- How to integrate multiple patient similarity measures into a single consistent similarity measure?
- How to present the similarity results and obtain user feedback in an intuitive and interactive way?

I will illustrate the effectiveness of our proposed algorithms for patient similarity learning in several different healthcare scenarios. I will demonstrate an interactive visual analytic system that allows users to cluster patients and to refine the underlying patient similarity metric. Finally, I will highlight future work that I am pursuing.

**Speaker Biography**: Jimeng Sun is a research staff member at Healthcare Analytic Department of IBM TJ Watson Research Center. He leads research projects of medical informatics, especially in developing large-scale predictive and similarity analytics on healthcare applications. He has extensive research track records on data mining research: specialized in healthcare analytics, big data analytics, similarity metric learning, social network analysis, predictive modeling and visual analytics. He has published over 70 papers, filed over 20 patents (4 granted). He has received ICDM best research paper in 2007, SDM best research paper in 2007, and KDD Dissertation runner-up award in 2008.

Sun received his B.S. and M.Phil. in Computer Science from Hong Kong University of Science and Technology in 2002 and 2003, and PhD in Computer Science in Carnegie Mellon University in 2007, specialized on data mining on streams, graphs and tensor data.

April 18, 2013

High-throughput sequencing has transformed biomedical research, however making sense of this resource requires sophisticated computational tools. We have developed the Galaxy platform that makes these tools available to a wide audience of researchers, while ensuring that analyses are reproducible and can be communicated transparently. In this talk I will introduce the Galaxy platform, describe challenges and solutions for scalability and sustainability including the Galaxy Cloudman project that makes this accessible analysis environment available on Cloud resources, enabling researchers without dedicated compute resources to perform large-scale data analysis cost effectively. Additionally, I will introduce the the new Galaxy visual analytics framework which provides an extensible solution for tightly integrated data analysis and visualization.

**Speaker Biography**: James Taylor is an Assistant Professor in the departments of Biology and Mathematics & Computer Science at Emory University. He is one of the original developers of the Galaxy platform for data analysis, and his group continues to work on extending the Galaxy platform. His group also works on understanding genomic and epigenomic regulation of gene transcription through integrated analysis of functional genomic data. James received a PhD in Computer Science from Penn State University, where he was involved in several vertebrate genome projects and the ENCODE project.

April 19, 2013

Consider a client who wishes to outsource database storage to a cloud service. As her privacy and the service’s liability are key concerns (e.g., for medical data), she ought to first encrypt her data under a key unknown to the server. However, standard encryption schemes hide *all information* about the data and would not allow the service to provide any useful search functionality to the client. To address this, we propose a new framework of “efficiently searchable encryption” that balances two seemingly-contradictory goals: (1) allowing efficient search (nearly the same efficiency as for unencrypted data), and (2) meeting strong and rigorous security models. My talk will specifically focus on applying our framework to support range queries over encrypted data using *order-preserving encryption* (OPE), a concept introduced in the database community (Agrawal et al., SIGMOD ’04). We propose the first rigorous security model that an OPE scheme should achieve. We then show how to construct a highly practical OPE scheme achieving it by using a sampling algorithm for the hypergeometric distribution. We then explain why, curiously, we still do not obtain a good understanding of the data privacy our “secure” OPE scheme provides, and present some additional security models and results to address this. In the end, we get an essentially complete characterization of the data privacy our scheme provides. We conclude by discussing some future directions.

**Speaker Biography**: Adam O’Neill received his Ph.D. in Computer Science from the Georgia Institute of Technology in 2010 (with Alexandra Boldyreva). Since then, he has held postdoctoral research appointments at the University of Texas at Austin (with Brent Waters) and at Boston University (with Ran Canetti and Leonid Reyzin), where he is currently employed. His research is in cryptography, with a focus on resolving the tension between theory and practice.

*Distinguished Lecturer*

April 25, 2013

Software-Defined Networking (SDN) has all the signs of a fad: massive hype in the trade rags, an increasing number of academic papers, and widespread confusion about what SDN really means (Isn’t it just OpenFlow? Isn’t it all about centralization? Isn’t it all just hot air?). This talk will try to dispel some of this confusion by discussing how SDN arises from a few natural abstractions for the network control plane. The talk will end with some speculations about the broader implications of SDN.

**Speaker Biography**: Scott Shenker spent his academic youth studying theoretical physics but soon gave up chaos theory for computer science . Continuing to display a remarkably short attention span, his research over the years has wandered from performance modeling and networking to game theory and economics. Unable to focus on any single topic, his current research projects include cluster programming models, genomic sequence aligners, software-defined networking, and Internet architecture. Unable to hold a steady job, he currently splits his time between the U. C. Berkeley Computer Science Department and the International Computer Science Institute.

May 7, 2013

How, as computer scientists, can we help Marlin find Nemo in a very short time? The answer may lie in the Probabilistic Method. Pioneered by Paul Erdös more than seven decades back, the Probabilistic Method has provided a number of widely used tools in Combinatorics. However, many of these tools are non-constructive: while they show the existence of certain combinatorial structures, they do not tell us how to find them. One such powerful technique is László Lovász’s famous Local Lemma (LLL). LLL has diverse range of applications that include breakthroughs in packet-routing, a variety of theorems in graph-coloring, and a host of other surprising applications where probability appears to have no role. While the original LLL was nonconstructive, there has been a series of works trying to devise algorithmic versions of LLL. In the first half of my talk, I will describe our work in this regime which covers many compelling applications that were out of reach by previous methods. Using our technique, we resolve a fascinating open question in the area of resource allocation and scheduling while giving the first Monte-Carlo approximation algorithms for a large variety of combinatorial problems.

Probability comes as a surprising element in many of the above problems, however, there are applications where probability is inherent in the data. Applications such as information retrieval, data integration and cleaning, text analytics, social network analysis etc. deal with voluminous amount of uncertain data. Probability theory can play a critical role in designing scalable algorithms under such uncertainty. In the second part of my presentation, I will talk about a basic problem of ranking and how using probabilistic generating function idea, we can give a unified approach to finding top-k results over probabilistic databases.

**Speaker Biography**: Barna Saha is a Senior Member of Research at the AT&T Shannon Laboratories. She obtained her Ph.D. in Computer Science from University of Maryland College Park in 2011. Her Research interest spans the areas of theoretical computer science and database management system, such as design and analysis of algorithms, probabilistic methods, graph theory and big data analysis. She is the co-winner of the best paper award in VLDB 2009, one of the premier conferences in databases, for her work on probabilistic ranking algorithms. She is also the recipient of Deans Fellowship Award, University of Maryland, 2010, for her dissertation research.

May 8, 2013

We live in the era of “Big Data,” where we are surrounded by a dauntingly vast amount of information. How can we help people quickly navigate the data and acquire useful knowledge from it? Probabilistic models provide a general framework for analyzing, predicting and understanding the underlying patterns in the large-scale and complex data.

Using a new recommender system as an example, I will show how we can develop principled approaches to advance two important directions in probabilistic modeling—exploratory analysis and scalable inference. First, I will describe a new model for document recommendation. This model not only gives better recommendation performance, but also provides new exploratory tools that help users navigate the data. For example, a user can adjust her preferences and the system can adaptively change the recommendations. Second, building a recommender system like this requires learning the probabilistic model from large-scale empirical data. I will describe a scalable approach for learning a wide class of probabilistic models, a class that includes our recommendation model, from massive data.

**Speaker Biography**: Chong Wang is a project scientist in the Machine Learning Department, Carnegie Mellon University. He received his PhD from Princeton University in 2012, advised by David Blei. His research lies in probabilistic graphical models and their applications to real-world problems. He has won several awards, including a best student paper award at KDD 2011, a notable paper award at AISTATS 2011 and a best student paper award honorable mention at NIPS 2009. He received the Google PhD Fellowship for machine learning and the Siebel Scholar Fellowship. His thesis was nominated for ACM Doctoral Dissertation Award by Princeton University in 2012.

May 9, 2013

In this talk, I present a system for the automatic synthesis of efficient algorithms specialized for a particular memory hierarchy and a set of storage devices. The developer provides two independent inputs: 1) an algorithm that ignores memory hierarchy and external storage aspects; and 2) a description of the target memory hierarchy, including its topology and parameters. Our system is able to automatically synthesize memory-hierarchy and storage-device-aware algorithms out of those specifications, for tasks such as joins and sorting. The framework is extensible and allows developers to quickly synthesize custom out-of-core algorithms as new storage technologies become available. This research draws on work from a number of areas, such as systems (data locality principles, cost-based optimization, secondary storage algorithms) programming languages, and compilers (domain specific languages, program synthesis).

This is joint work with Yannis Klonatos, Andres Noetzli, Andrej Spielmann, and Viktor Kuncak.

**Speaker Biography**: Christoph Koch is a professor of Computer Science at EPFL, specializing in data management. Until 2010, he was an Associate Professor in the Department of Computer Science at Cornell University. Previously to this, from 2005 to 2007, he was an Associate Professor of Computer Science at Saarland University. Earlier, he obtained his PhD in Artificial Intelligence from TU Vienna and CERN (2001), was a postdoctoral researcher at TU Vienna and the University of Edinburgh (2001-2003), and an assistant professor at TU Vienna (2003-2005). He obtained his Habilitation degree in 2004. He has won Best Paper Awards at PODS 2002, ICALP 2005, and SIGMOD 2011, an Outrageous Ideas and Vision Paper Award at CIDR 2013, a Google Research Award (in 2009), and an ERC Grant (in 2011). He is a PI of the Billion-Euro EU FET Flagship Human Brain Project. He (co-)chaired the program committees of DBPL 2005, WebDB 2008, and ICDE 2011, and was PC vice-chair of ICDE 2008 and ICDE 2009. He has served on the editorial board of ACM Transactions on Internet Technology as well as in numerous program committees. He currently serves as PC co-chair of VLDB 2013 and Editor-in-Chief of PVLDB.