Spring 2012

Student Seminar

January 21, 2012

Data gathered from multi-month to multi-year battery-powered environmental monitoring sensor networks present numerous challenges. This thesis explores three problems. First, design issues related to loading, storing and data integrity are studied in detail. An end-to-end system addressing these tasks by decoupling deployment-specific and deployment-independent phases is presented. This solution places a strong emphasis on the ability to trace the origins of every collected measurement for provenance and scientific reproducibility. Second, we explore the problem of assigning accurate global timestamps to the measurements collected using the motes’ local clocks. In deployments lacking a persistent gateway, a data-driven approach is employed to assign timestamps to within 10 parts per million. Building on these experiences, we developed a novel low-power approach to accurately timestamp measurements in the presence of random, frequent mote reboots. The system is tested in simulation and on a real deployment in a Brazilian rain forest. It is able to achieve an accuracy in the order of seconds for more than 99% of measurements even when a global clock source is missing for days to months. Lastly, this thesis explores a generic data-driven approach to reduce communication costs in order to increase network lifetime. In particular, spatial correlation among sampling stations is leveraged to adaptively retrieve data from a small subset of informative sensors rather than all instrumented locations. Soil temperature data collected every half hour for four months from 50 locations is used to evaluate this system. The method achieves a factor of two reduction in collected data with a median error of 0.06 C and 95th percentile error of 0.325 C.

This work is part of the Life Under Your Feet project developed at the Hopkins Inter- Networking Research (HiNRG) and eScience Labs at the Johns Hopkins University.

Speaker Biography: Jayant Gupchup received his Bachelors in Computer Engineering from Mumbai University in 2003. From Sep 2003 to July 2005, he worked at the Inter-University Centre for Astronomy and Astrophysics (IUCAA). In Fall of 2005, he began his Ph.D. at the Department of Computer Science at the Johns Hopkins University. His research focusses on data management in long-term environmental monitoring networks, and he is jointly advised by Dr. Andreas Terzis and Prof. Alex Szalay. In 2007, he worked at the Microsoft Bay Area Research Center as a summer intern. He received a masters in Applied Mathematics and Statistics in May 2010 under the supervision of Prof. Carey Priebe. After his Ph.D., he will join the parallel data warehousing team at Microsoft in March 2011.

Student Seminar

January 27, 2012

Scientists are increasingly finding themselves in a paradoxical situation: on a never-ending quest to collect data, they are collecting more data than they can handle. This growth in data comes from three main areas: better instrumentation, improved simulations, and increased data sharing between scientists. The work in this talk describes a variety of techniques to support the exploration and analysis of large scientific datasets, and is drawn from experiences working with two different domain sciences: computational fluid dynamics and estuarine science.

First, we discuss the JHU Turbulence Database Cluster, an environment for the exploration of turbulent flows. We provide a web-service interface for accessing the complete space-time history of a Direct Numerical Simulation. This service gives researchers from around the world the tools needed for spatial and temporal exploration of the simulation. In this talk, we will discuss the overall system design and prototypical applications. We will also discuss the details of implementation, including hierarchical spatial indexing, cache-sensitive spatial scheduling of batch workloads, and localizing computation through data partitioning.

We will also discuss work to improve queries among multiple scientific data sets from the Chesapeake Bay as part of the CBEO project. We developed new data indexing and query processing tools that improve the efficiency of comparing, correlating, and joining data in non-convex regions. We use computational geometry techniques to automatically characterize space from which data are drawn, partition the region based on that characterization, and then create an index from the partitions. In the case of the Chesapeake Bay, our technique ensures that all data from a given tributary (i.e., the Potomac River) will be occupy contiguous regions of the index, which makes the data from these regions contiguous on disk.

Speaker Biography: Eric Perlman received a B.S. in Computer Engineering in 2002 from the University of California, Santa Cruz. He enrolled in the Computer Science Ph.D. program at Johns Hopkins University in 2003. He has worked on large distributed file systems during internships at both IBM Almaden Research Center in 2003 and Google in 2004.

At Johns Hopkins, Eric’s work primarily focused on improving access to large scientific data. He helped build the infrastructure for three interdisciplinary research projects: the JHU Turbulence Database Cluster, the Chesapeake Bay Environmental Observatory Testbed (CBEO:T), and the Open Connectome Project (OCP).

As of December 2012, Eric is working as a Bioinformatics Specialist at the Howard Hughes Medical Institute’s Janelia Farm Research Campus in Ashburn, VA. He is working with Dr. Davi Bock to build a processing pipeline for data captured using high-throughput electron microscopy.

Video Recording >>

January 31, 2012

Latent variable models are widely used in applications to automatically recover simple underlying signals from noisy high-dimensional data. The challenge in estimating such models stems from the presence of hidden (unobserved) variables, and typical local search methods used for this task (e.g., E-M) generally lack basic performance guarantees such as statistical consistency and computational efficiency. In this talk, I will discuss recent developments in linear algebraic methods for learning certain classes of latent variable models, including parameter estimation for hidden Markov models, and structure learning of latent variable tree models. Unlike the local search heuristics, the proposed linear algebraic methods come with statistical and computational efficiency guarantees under mild conditions on the data distribution. Central to the new techniques is the characterization of the models in terms of low-order moments (e.g., averages, correlations) of observable variables, which are readily estimated from data.

This talk is based on various joint works with Anima Anandkumar, Kamalika Chaudhuri, Sham Kakade, Le Song, and Tong Zhang.

Speaker Biography: Daniel Hsu is a postdoctoral researcher at Microsoft Research New England. Previously, he was a postdoc with the Department of Statistics at Rutgers University and the Department of Statistics at the University of Pennsylvania from 2010 to 2011, supervised by Tong Zhang and Sham M. Kakade. He received his Ph.D. in Computer Science in 2010 from UC San Diego, where he was advised by Sanjoy Dasgupta, and his B.S. in Computer Science and Engineering in 2004 from UC Berkeley. His research interests are in algorithmic statistics and machine learning.

February 2, 2012

The completion of the human genome project set a stepping stone in building catalogs of common human genetic variation. These catalogs, in turn, enabled the search for associations between common variants and complex human traits and diseases, by performing Genome-Wide Association Studies (GWAS). GWAS have been successful in discovering thousands of statistically significant, reproducible, genotype-phenotype associations. However, the discovered variants (genotypes) explain only a small fraction of the phenotypic variance in the population for most human traits. In contrast, the heritability, defined as the proportion of phenotypic variance explained by all genetic factors, was estimated to be much larger for those same traits using indirect population-based estimators. This gap is referred to as ‘missing heritability’.

Mathematically, heritability is defined by considering a function $$F$$ mapping a set of (Boolean) variables, $$(x_1,.., x_n)$$ representing genotypes, and additional environmental or ‘noise’ variables $$\epsilon$$, to a single (real or discrete) variable $$z$$, representing phenotype. We use the variance decomposition of $$F$$, separating the linear term, corresponding to additive (narrow-sense) heritability, and higher-order terms, representing genetic-interactions (epistasis), to explore several explanations for the ‘missing heritability’ mystery. We show that genetic interactions can significantly bias upwards current population-based heritability estimators, creating a false impression of ‘missing heritability’. We offer a solution to this problem by providing a novel consistent estimator based on unrelated individuals. We also use the Wright-Fisher process from population genetics theory to develop and apply a novel power correction method for inferring the relative contributions of rare and common variants to heritability. Finally, we propose a novel algorithm for estimating the different variance components (beyond additive) of heritability from GWAS data.

Speaker Biography: Or Zuk is a postdoctoral researcher at the Broad Institute of MIT and Harvard, in Eric Lander’s group. Previously, he completed a Ph.D. in Computer Science and Applied Mathematics at the Weizmann Institute of Science under the supervision of Eytan Domany. His main research interests are in computational and statistical problems arising from applications in genomics and genetics.

Video Recording >>

February 7, 2012

The stereotypical view of computing, and hence computer security, is a landscape filled with laptops, desktops, smartphones and servers; general purpose computers in the proper sense. However, this is but the visible tip of the iceberg. In fact, most computing today is invisibly embedded into systems and environments that few of us would ever think of as computers. Indeed, applications in virtually all walks of modern life, from automobiles to medical devices, power grids to voting machines, have evolved to rely on the same substrate of general purpose microprocessors and (frequently) network connectivity that underlie our personal computers. Yet along with the power of these capabilities come the same potential risks as well. My research has focused on understanding the scope of such problems by exploring vulnerabilities in the embedded environment, how they arise, and the shape of the attack surfaces they expose. In this talk, I will particularly discuss recent work on two large-scale platforms: modern automobiles and electronic voting machines. In each case, I will explain how implicit or explicit assumptions in the design of the systems have opened them to attack. I will demonstrate these problems, concretely and completely, including arbitrary control over election results and remote tracking and control of an unmodified automobile. I will explain the nature of these problems, how they have come to arise, and the challenges in hardening such systems going forward.

Speaker Biography: Stephen Checkoway is a Ph.D. candidate in Computer Science and Engineering at UC San Diego and before that he received his B.S. from the University of Washington. He is also a member of the Center for Automotive Embedded Systems Security, a collaboration between UC San Diego and the University of Washington. Checkoway’s research spans a range of applied security problems including the security of embedded and cyber-physical systems, electronic voting, and memory safety vulnerabilities.

Distinguished Lecturer

February 9, 2012

Like it or not, dynamically typed scripting languages are far more popular than programming languages that originate in academic laboratories. Most scripting languages are untyped and have a flexible semantics for primitive operations. A lot of programmers find these attributes appealing and use scripting languages for just these reasons. Reflective programmers are also beginning to notice, however, that when small untyped scripts grow old (and/or large), maintaining them becomes difficult. A lack of types means that programmers must (re)discover critical pieces of design information every time they wish to change or improve a program.

My team and I have therefore embarked on a research program that aims to solve this large and growing engineering problem with a combination of experience and tools from academic programming language research. In this talk I will present the first milestone: Typed Racket, an explicitly typed extension of Racket, a Scheme-like scripting language. Using Typed Racket, programmers can add types to a code base on a module by module basis. Two factors make such transformations straightforward and useful. First, the type system is highly flexible, combining numerous academic results with the novel notion of occurrence typing. Second, the type system uses Findler’s higher-order contracts to make the integration of typed and untyped modules safe.

Collaborators: Sam Tobin-Hochstadt, Ryan Culpepper

Speaker Biography: Matthias Felleisen, a Trustee Professor at Northeastern University, explores all aspects of programming languages and program design. His current research involves work on behavioral software contracts, gradual typing of scripting languages, theories and practice of language interoperability, module systems, and language extensibility. In addition to his research, Felleisen engages in educational outreach work. For the past 15 years, he and his team have worked with middle schools, high schools, after-school programs, and college faculty on a radical reform of the introductory programming curriculum. A lot of his educational work inspires his research, and many research efforts end up improving his educational work. The ACM inducted Felleisen as a Fellow in 2007. In 2009, he received the ACM Karl V. Karlstrom Award.

Student Seminar

February 9, 2012

As some scientific projects begin to collect data well into the petabyte range, the the technology used to analyze and retrieve such data must evolve appropriately. When presented with such a large volume, prior data delivery techniques, such as time-honored system of simply copying all data to local storage prior to analysis, become impractical. This thesis studies the use of databases to host large-scale, scientific data. We describe early systems, such as SkyServer and CASJobs, and discuss an implementation of automatic provenance within them. Additionally, we discuss various observations regarding query patterns collected from such systems, over time. Lastly, this thesis concludes with a discussion of TileDB, a novel distributed computing framework using independent shared-nothing databases. In addition to features common to many distributed systems, such as automatic parallelization and incremental fault tolerance, TileDB also combines several features that are fairly novel within the field, such as dynamic allocation of both data and work, long term, adaptive data curation and transparent integration with existing database deployments.

Speaker Biography: Nolan Li first attended Johns Hopkins University as an under-graduate in 1998. After some deliberation, he chose to pursue Computer Science as a course of study. He has been working with the Sloan Digital Sky Survey, as an employee or student, since 2001. He is the author of several tools currently in use by the project, including CASJobs and Open SkyQuery. His research focuses on large-scale data science, and his papers have been presented at conferences such as ADASS, Microsoft E-Science and Super Computing.

February 14, 2012

The web is our primary gateway to many critical services and offers a powerful platform for emerging applications. However, security of the web platform is a major concern. Web vulnerabilities are pervasive, yet techniques for finding and defending against them are at a nascent stage. How can we build a secure web platform for the future? Towards this goal, my work investigates a broad range of solutions including new browser abstractions, languages, verification techniques and analyses. I present two examples of my work which systematically introduce formal foundations into today’s web platform. First, I present a system for automatically finding vulnerabilities in JavaScript applications using dynamic symbolic execution. This system has discovered several previously unknown flaws in widely used web applications without raising any false positives. A key feature that enables this precision is our new decision procedure for handling complex string operations found extensively in realworld code. In the second example, I present a new type system for ensuring integrity of web application code by construction. The type system is designed to be bolted onto existing web frameworks and has been adopted in the compilation infrastructure for commercial web applications such as Google+. These examples demonstrate that by developing new formal techniques which can be realized in practical systems, we can build a web platform that is secure by construction.

Speaker Biography: Prateek Saxena is a PhD student in Computer Science at UC Berkeley. His research interests are in computer security and its areas of intersection with programming languages, formal methods, compilers and operating systems. His current work focuses on software and web security. He is the recipient of several awards, including the Symantec Research Fellowship Award in 2011 and the AT&T Best Applied Security Paper Award in 2010.

Video Recording >>

Don P. Giddens Lecture

February 16, 2012

The Internet presents compelling reach, cost and capacity properties that drive more and more forms of communication to use it as their network of choice. These properties stem from a few core design principles underlying the Internet, such as best effort packet switching and routing, and end-to-end reliability, congestion control and addressing: in essence, keep it simple in the middle and smart at the edge.

New applications bring new demands that clash with the core principles: high performance reliability for large file transfers; low latency interactivity for VoIP calls; point-to-multipoint reliable transport and delivery for live TV; “perfect” reliability and timeliness for remote surgery and remote manipulation; and intrusion resiliency required by clouds, SCADA, and other critical infrastructure systems.

This talk surveys a decade-long journey developing the “right” network paradigm to address the new demands. We present the overlay architecture and associated protocols that were invented along the way, moving more of the intelligence to the middle of the network. The architecture is implemented by a flexible, software-based overlay router that is augmented with protocols tailored to new demands, while using the Internet as is. We discuss a reliable protocol that dramatically reduces average latency, a second protocol tailored to VoIP, and a third protocol tailored to live TV.

A new cloud networking service actualizes this network paradigm at scale. Having been deployed globally over the last two years, this service is already changing the world of live TV transport and delivery.

Speaker Biography: Yair Amir, who joined the Johns Hopkins Computer Science faculty in 1995, seeks to invent high performance, survivable and secure distributed systems that make a difference. He holds a B.Sc. and M.Sc. from the Technion, Israel Institute of Technology, and a Ph.D. from the Hebrew University of Jerusalem, Israel.

Video Recording >>

February 21, 2012

Much of machine-learning research is about discovering patterns—building intelligent agents that learn to predict future accurately from historical data. While this paradigm has been extremely successful in numerous applications, complex real-world problems such as content recommendation on the Internet often require the agents to learn to act optimally through autonomous interaction with the world they live in, a problem known as reinforcement learning.

Using a news recommendation module on Yahoo!’s front page as a running example, the majority of the talk focuses on the special case of contextual bandits that have gained substantial interests recently due to their broad applications. We will highlight a fundamental challenge known as the exploration/exploitation tradeoff, present a few newly developed algorithms with strong theoretical guarantees, and demonstrate their empirical effectiveness for personalizing content recommendation at Yahoo!. At the end of the talk, we will also summarize (briefly) our earlier work on provably data-efficient algorithms for more general reinforcement-learning problems modeled as Markov decision processes.

Speaker Biography: Lihong Li is a Research Scientist in the Machine Learning group at Yahoo! Research. He obtained a PhD degree in Computer Science from Rutgers University, advised by Michael Littman. Before that, he obtained a MSc degree from the University of Alberta, advised by Vadim Bulitko and Russell Greiner, and BE from the Tsinghua University. In the summers of 2006-2008, he enjoyed interning at Google, Yahoo! Research, and AT&T Shannon Labs, respectively. His main research interests are in machine learning with interaction, including reinforcement learning, multi-armed bandits, online learning, active learning, and their numerous applications on the Internet. He is the winner of an ICML’08 Best Student Paper Award, a WSDM’11 Best Paper Award, and an AISTATS’11 Notable Paper Award.

Video Recording >>

February 23, 2012

Today’s computing landscape relies critically on high-performance, resilient, and secure networked systems. As application workloads, security requirements, and organizational policy constraints change over time, the network infrastructure needs to evolve in order to meet the increasing need for performance, reliability, and security.

Most of this evolution in today’s networks happens via the deployment of specialized network appliances or “middleboxes”. Unfortunately, middleboxes today are expensive and closed systems, with little or no hooks for extensibility. Furthermore, they are acquired from independent vendors and deployed as standalone devices with little cohesiveness in how the ensemble of middleboxes is managed. As network requirements continue to grow in both scale and variety, this bottom-up approach puts the network infrastructure on a trajectory of growing device sprawl with corresponding escalation in capital and management costs.

To address this challenge, this talk describes the design and implementation of a new architecture for middlebox deployments that systematically explores opportunities for consolidation both at the level of building individual middleboxes and in managing a network of middleboxes. I will show that such consolidation introduces new opportunities for innovation and performance optimization that do not exist in current middlebox deployments.

Speaker Biography: Vyas Sekar is a Research Scientist at Intel Labs. He is currently a member of the Intel Science and Technology Center for Secure Computing located at the University of California, Berkeley. His research interests lie at the intersection of networking, security, and systems. He received his Ph.D. from the Computer Science Department at Carnegie Mellon University in 2010, where he was co-advised by Michael Reiter and Hui Zhang. Before that, he obtained his B. Tech from the Computer Science Department at the Indian Institute of Technology Madras where he was awarded the President of India Gold Medal.

February 28, 2012

Over the last 10 years, probabilistic graphical models have become one of the cornerstones of modern machine learning. As science and engineering increasingly turn to them to solve difficult learning and data analysis problems, it is becoming more and more important to provide tools that make advanced statistical inference accessible to a broad community.

In this talk I will discuss my work on probabilistic programming, a recent generalization of graphical models: rather than marry statistics with graph theory, probabilistic programming marries Bayesian probability with computer science. It allows modelers to specify a complex generative process using syntax that resembles a modern programming language, easily defining distributions using language features such as data structures, recursion, or native libraries.

Scalable inference is the key challenge. I will discuss how we are leveraging concepts from programming language theory (such as monads, anonymous functions, and memoization), as well as compiler design (program analysis, source code transformations, nonstandard interpretations, and code factorization) to both define and implement universal inference algorithms for probabilistic programming languages. I will illustrate the results on a variety of tasks, with emphasis on inversion problems from geophysics.

Speaker Biography: David Wingate is a research scientist at MIT with a joint appointment in the Laboratory for Information Decision Systems and the Computational Cognitive Science group. He obtained a B.S. and M.S. in Computer Science from Brigham Young University, and a Ph.D. in Computer Science from the University of Michigan.

His research focuses on the intersection of probabilistic modeling (with an emphasis on Bayesian nonparametrics), machine learning, dynamical systems modeling, reinforcement learning and probabilistic programming.

Video Recording >>

March 1, 2012

Second-generation DNA sequencing instruments are improving rapidly and are now capable of sequencing hundreds of billions of nucleotides of data, enough to cover the human genome hundreds of times over, in about a week for a few thousand dollars. Consequently, sequencing is now a common tool in the study of molecular biology, genetics, and human disease. But with these developments comes a problem: growth in per-sequencer throughput (currently about 4-fold per year) is drastically outpacing growth in computer speed. As the throughput gap widens over time, the crucial research bottlenecks are increasingly computational: computing, storage, labor, power.

I will describe two methods and four open source software tools (Bowtie, Bowtie 2, Crossbow and Myrna) that tackle this throughput gap using novel algorithms and approaches from data-intensive computing. These tools build primarily on two insights. First, that the Burrows-Wheeler Transform and the FM Index, previously used for data compression and exact string matching, can be extended to facilitate fast and memory-efficient alignment of DNA sequences to long reference genomes such as the human genome. Second, that those methods can be combined with MapReduce and cloud computing to solve common comparative genomics problems in a manner that addresses “big data” desiderata, including scalability, fault tolerance, and economy.

Speaker Biography: Ben Langmead is a Research Associate at the Johns Hopkins Bloomberg School of Public Health, Department of Biostatistics. He is also completing his Ph.D. in Computer Science this semester at University of Maryland, advised by Steven L. Salzberg. He received a M.Sc. in Computer Science in 2009 from University of Maryland, advised by Steven L. Salzberg and Mihai Pop. Before graduate school, Ben was employed at Reservoir Labs, Inc., where he worked on developing compiler software and network intrusion detection software for parallel network processor architectures. Ben received a B.A. in Computer Science from Columbia University in 2003.

Ben’s research tackles problems at the intersection of Computer Science and Genomics. At Johns Hopkins, Ben collaborates with Biostatisticians, Biologists, and other Computer Scientists to develop methods for analyzing second-generation DNA sequencing data.

Video Recording >>

March 2, 2012

What is in common between translating from English into Chinese and compiling C++ into machine code? And yet what are the differences that make the former so much harder for computers? How can computers learn from human translators?

This talk sketches an efficient (linear-time) “understanding + rewriting” paradigm for machine translation inspired by both human translators as well as compilers. In this paradigm, a source language sentence is first parsed into a syntactic tree, which is then recursively converted into a target language sentence via tree-to-string rewriting rules. In both “understanding” and “rewriting” stages, this paradigm closely resembles the efficiency and incrementality of both human processing and compiling. We will discuss these two stages in turn.

First, for the “understanding” part, we present a linear-time approximate dynamic programming algorithm for incremental parsing that is as accurate as those much slower (cubic-time) chart parsers, while being as fast as those fast but lossy greedy parsers, thus getting the advantages of both worlds for the first time, achieving state-of-the-art speed and accuracy. But how do we efficiently learn such a parsing model with approximate inference from huge amounts of data? We propose a general framework for structured prediction based on the structured perceptron that is guaranteed to succeed with inexact search and works well in practice.

Next, the “rewriting” stage translates these source-language parse trees into the target language. But parsing errors from the previous stage adversely affect translation quality. An obvious solution is to use the top-k parses, rather than the 1-best tree, but this only helps a little bit due to the limited scope of the k-best list. We instead propose a “forest-based approach”, which translates a packed forest encoding exponentially many parses in a polynomial space by sharing common subtrees. Large-scale experiments showed very significant improvements in terms of translation quality, which outperforms the leading systems in literature. Like the “understanding” part, the translation algorithm here is also linear-time and incremental, thus resembles human translation.

We conclude by drawing a few future directions.

Speaker Biography: Liang Huang is a Research Assistant Professor at University of Southern California (USC), and a Research Scientist at USC’s Information Sciences Institute (ISI). He received his PhD from the University of Pennsylvania in 2008, and worked as a Research Scientist at Google before moving to USC/ISI. His research focuses on efficient search algorithms for natural language processing, esp. in parsing and machine translation, as well as related structured learning problems. His work received a Best Paper Award at ACL 2008, and three Best Paper Nominations at ACL 2007, EMNLP 2008, and ACL 2010.

Video Recording >>

March 6, 2012

Data are being generated at an unprecedented pace in various scientific and engineering areas including biomedical engineering, and materials science, and social science. These data provide us with precious opportunities to reveal hidden relationships in natural or synthetic systems and predict their functions and properties. With growing complexity, however, the data impose new computational challenges—for example, how to handle the high dimensionality, nonlinear interactions, and the massive volume of the data.

To address these new challenges, I have been developing advanced Bayesian models to capture the data complexity and designing scalable algorithms to learn the models efficiently from data. In this talk, I will describe two of my recent works along this line: 1) efficient learning of novel nonparametric models on tensors to discover communities in social networks and parse noisy email networks, and 2) parallel inference for new hierarchical Bayesian models to identify rare cancer stem cells from massive single-cell tensor-valued data. I will present experimental results on real world data — demonstrating the superior predictive performance of the proposed approaches — and discuss other applications of these approaches such as in patient drug response analysis and neuroscience.

Speaker Biography: Alan Qi obtained his PhD from MIT in 2005 and worked as a postdoctoral researcher at MIT from 2005 to 2007. In 2007, he joined Purdue University as an assistant professor of Computer Science and Statistics (and Biology by courtesy). He received the A. Richard Newton Breakthrough Research Award from Microsoft Research in 2008, the Interdisciplinary Award from Purdue University in 2010, and the NSF CAREER award in 2011. His research interest lies in scalable Bayesian learning and their applications. His group not only develops sparse, nonparametric, and dynamic learning algorithms, but also collaborates with domain experts— such as biomedical researchers at Harvard and Purdue, materials scientists at MIT, and psychologists at Toronto University— for a wide range of scientific and engineering applications.

Video Recording >>

March 13, 2012

The great successes of the Internet and modern computing devices in making business, travel, communication, and much of life better in many ways are moderated by the increasing evidence that the foundations of the technology are vulnerable to many kinds of attacks. The entire concept of privacy is in a state of flux. While the precise meaning of “cyberwar” is debated, the military prepares for attack and defense in the medium of cyberspace. What we now call cybersecurity research has its roots in the late 1960s and early 1970s. This talk will survey the state of things, consider why the results of past research seem to have had so little uptake in modern products, and suggest a few directions for future research and policy that could lead us to a less dangerous place.

Speaker Biography: Following a 23-year career conducting cybersecurity research at the U.S. Naval Research Laboratory, Carl Landwehr spent the past twelve years funding, managing, and guiding cybersecurity research programs for the National Science Foundation (NSF), the Intelligence Advanced Research Projects Activity (IARPA) and its predecessor organizations, and the Defense Advanced Research Projects Agency (DARPA). Dr. Landwehr recently completed a four year term as Editor-in-Chief of IEEE Security & Privacy Magazine. He has received awards for research achievement and community service from ACM SIGSAC, from the IEEE Computer Society, from the IEEE-CS TC on Security and Privacy, and from IFIP. Dr. Landwehr’s research interests include all areas of trustworthy computing. His degrees are from Yale and the University of Michigan, and he has taught classes at Purdue, the University of Maryland, Georgetown, and Virginia Tech. Since November, 2011, he has worked as an independent consultant.

Video Recording >>

March 15, 2012

Security has become one of the major concerns for today’s Internet. End users, however, are slow in adopting new security technologies. Many users cannot manage security well by themselves. Ideally, security mechanisms should be as transparent as possible to the users. On the other hand, IT managers desire efficient and scalable protection mechanisms.

Towards addressing these issues, in this talk, I would like to introduce two of my efforts. First, I will present the design of NetShield, a new vulnerability signature based NIDS/NIPS, which achieves high throughput comparable to that of the state-of-the-art regular expression based systems while offering much better accuracy. In particular, we propose a candidate selection algorithm which efficiently matches thousands of vulnerability signatures simultaneously, and design a parsing transition state machine that achieves fast protocol parsing.

Second, I will talk about WebShield, a secure web proxy design that protects clients from web-based exploits by processing potentially malicious JavaScript in a sandboxed environment (shadow browser) on a middlebox. With shadow browsers, WebShield also aims to deploy client-based defenses against various classes of web attacks without client modifications.

Speaker Biography: Zhichun “ZC” Li is a research staff member at NEC Research Labs in Princeton, NJ. At NEC Labs, he currently leads the effort of designing a scalable Android app analysis framework. Before joining NEC, he received his Ph.D. on Dec 2009 from Northwestern University. He earned both M.S. and B.S. degrees from Tsinghua University in China. His research interests span the areas of security, networking and distributed systems with an emphasis on smartphone security, web security, network security, social network security, cloud security, network measurement and distributed system diagnosis. Previously, he has conducted research at Microsoft Research Redmond and International Computer Science Institute (ICSI) of UC Berkeley.

Student Seminar

March 16, 2012

Human annotators are critical for creating the necessary datasets to train statistical learning algorithms. However, there exist several limiting factors to creating large annotated datasets, such as annotation cost and limited access to qualified annotators. In recent years, researchers have investigated overcoming this data bottleneck by resorting to crowdsourcing, which is the delegation of a particular task to a large group of individuals rather than a single person, usually via an online marketplace.

This thesis is concerned with crowdsourcing annotation tasks that aid either the training, tuning, or evaluation of statistical learners, across a variety of tasks in natural language processing. The tasks reflect a spectrum of annotation complexity, from simple class label selection, through selecting textual segments from a document, to composing sentences from scratch. The annotation setups were novel as they involved new types of annotators, new types of tasks, new types of data, and new types of algorithms that can handle such data.

The thesis is divided into two main parts: the first part deals with text classification, and the second part deals with machine translation (MT).

The first part deals with two examples of the text classification task. The first is the identification of dialectal Arabic sentences and distinguishing them from standard Arabic sentences. We utilize crowdsourcing to create a large annotated dataset of Arabic sentences, which is used to train and evaluate language models for each Arabic variety. The second task is a sentiment analysis task, that of distinguishing positive movie reviews from negative ones. We introduce a new type of annotations called rationales, which complement the traditional class labels, and aid learning system parameters that generalize better to unseen data.

In the second part, we examine how crowdsourcing can be beneficial to machine translation. We start with the evaluation of MT systems, and show the potential of crowdsourcing to edit MT output. We also present a new MT evaluation metric, RYPT, that is based on human judgment, and well-suited for a crowdsourced setting. Finally, we demonstrate that crowdsourcing can be helpful in collecting translations to create a parallel dataset. We discuss a set of features that can help distinguish well-formed translations from those that are not, and we show that crowdsourcing translation yields results of near-professional quality at a fraction of the cost.

Throughout the thesis, we will be concerned with how we can ensure that collected data is of high quality, and we will employ a set of quality control measures for that purpose. Those methods will be helpful not only in detecting spammers and unfaithful annotators, but also those who are simply unable to perform the task properly, which is a more subtle form of undesired behavior.

Speaker Biography: Omar Zaidan completed his undergraduate studies at St. Lawrence University, graduating summa cum laude in 2004 with a B.Sc. in Computer Science (with Honors) and Mathematics (with Honors), and a minor in Chemistry. He then joined the Computer Science Ph.D. program at Johns Hopkins University, initially working under the supervision of Christian Scheideler, then became affiliated with the Center for Language and Speech Processing, working first with Jason Eisner and then Chris Callison-Burch, his thesis advisor.

At Hopkins, he was head T.A. for several courses, including Natural Language Processing and Artificial Intelligence; taught the department’s Introduction to Java summer offering in 2007; developed Z-MERT, an open source package for MT parameter tuning; created the Arabic Online Commentary dataset, consisting of over 50M words of Arabic reader commentary; and gained considerable experience with Amazon’s Mechanical Turk. He was also member of the organizing committees for the Workshops for Machine Translation (WMT) in 2010 and 2011, and the North East Student Colloquium on Artificial Intelligence (NESCAI) in 2010.

Omar has joined Microsoft Research as a senior software engineer.

Video Recording >>

March 27, 2012

We explore online learning approaches for detecting malicious Web sites (those involved in criminal scams) using lexical and host-based features of the associated URLs. We show that this application is particularly appropriate for online algorithms as the size of the training data is larger than can be efficiently processed in batch and because the distribution of features that typify malicious URLs is changing continuously. Using a real-time system we developed for gathering URL features, combined with a real-time source of labeled URLs from a large Web mail provider, we demonstrate that recently-developed online algorithms can be as accurate as batch techniques, achieving daily classification accuracies up to 99% over a balanced data set.

Speaker Biography: Justin Ma is a postdoc at the UC Berkeley AMPLab. Previously he was a PhD student at UC San Diego advised by Stefan Savage, Geoff Voelker, and Lawrence Saul. His primary research is in systems security. His interests include applications of machine learning to systems problems, systems for large-scale machine learning, and the impact of energy availability on computing.

Video Recording >>

March 29, 2012

Although cloud computing promises numerous benefits, including lower costs, rapid scaling, easier maintenance, and ubiquitous availability, a key challenge is how to protect users’ data in the cloud. Today, users effectively lose control of their data in the cloud, and if either the cloud infrastructure or applications are compromised, users’ privacy will be at risk. The ubiquitous concern over cloud data privacy demands a paradigm shift, such that users can retain control of their data in the cloud, and verify that the cloud providers have correctly enforced their privacy policies.

In this talk, I will describe several enabling technologies towards this vision. Specifically, I will talk about 1) how to safeguard users’ data against potentially compromised applications; 2) how to safeguard users’ data against a potentially compromised computation provider; and 3) how to safeguard users’ data against a potentially compromised storage provider. I will also talk about our ongoing effort at integrating these technologies to provide a cloud infrastructure which offers data protection at the platform level. In this way, users can benefit from the rich cloud applications without worrying about the privacy of their data; and application developers can focus on developing functionality while offloading the burden of providing security and privacy to the cloud platform.

Speaker Biography: Elaine Shi is a research scientist at University of California, Berkeley. She obtained her Ph.D. and Masters in Computer Science from Carnegie Mellon University, and her B.E. from Tsinghua University. Previously, she was also a Member of Research Staff at Palo Alto Research Center (PARC).

Elaine is broadly interested in the general area of security, privacy, and applied cryptography. In her research, she takes a unique approach where she combines theoretic innovations with practical system design and implementation. Her research spans a wide range of topics, including computation on encrypted data, privacy-preserving data mining, system security, sensor network and vehicular network security, usable authentication, secure storage systems, and so on. She has published more than 35 scholarly publications, and her work has received more than 2000 citations. Aside from security and privacy, Elaine is also interested in data mining. In particular, she and her team won the IJCNN/Kaggle Social Network Challenge in 2011.

Student Seminar

March 30, 2012

Over the past decade, many wireless sensor network systems have been deployed in various real-life environments with the goal of remotely collecting data from the physical world. Such low-power wireless systems, when combined with low-power medical sensors, provide the potential to alleviate many of the manual routine tasks in the healthcare domain as well. In turn, such automation techniques in the patient monitoring process can help increase the quality of patient care at many of the busy clinical environments and disaster scenes.

In this dissertation, we present MEDiSN, a wireless sensor network system for monitoring the physiological data of patients in clinical environments or victims in disaster sites. MEDiSN consists of mobile patient monitoring devices that collect physiological data and connect to a self-configuring wireless backbone network to relay their data to a gateway. As a preparation step for our deployments, we experimentally analyze the wireless channel environment at our target clinical environment and finally, evaluate the performance of MEDiSN in a real clinical environment using multiple pilot studies. Furthermore, based on the development and deployment experiences of MEDiSN, this dissertation identifies several factors that can assist wireless sensing systems to expand their horizons to new applications.

As an effort to address these factors, and moreover, expand the application coverage of the MEDiSN system, this dissertation introduces three major improvements made to MEDiSN. First, this dissertation introduces Egs, a more flexible and powerful mote platform. Egs is an ARM Cortex-M3-based mote platform that provides application developers with a powerful, yet, energy efficient sensing platform to serve in a variety of applications. Second, this dissertation discusses about the design and implementation experiences of Internet-standards compliant networking stacks for sensor networks. Specifically, we present the evaluation of the IETF RPL routing protocol in various configurations and also present MoMoRo, a supporting layer designed with the goal of assisting data collection protocols, such as RPL, to provide mobile devices in MEDiSN with a high packet reception performance while minimizing the packet overhead. The final part of the dissertation presents two adaptive transmission power control techniques for mobile nodes in low-power wireless networks designed with the goal of minimizing energy consumption and maximizing bandwidth utilization.

Speaker Biography: JeongGil Ko received the Bachelors in Engineering degree in computer science and engineering from Korea University in 2007 where he also worked as an undergraduate research student with Dr. Hyogon Kim at the Wireless Data Communications Laboratory from 2005 to 2007. He received the Master of Science in Engineering degree at the Department of Computer Science, Johns Hopkins University in 2009. At Johns Hopkins, JeongGil Ko has been a member of the Hopkins interNetworking Research Group (HiNRG) led by Dr. Andreas Terzis. In 2010, he was at the Stanford Information Networking Group (SING) with Dr. Philip Levis at Stanford University as a visiting researcher. He is a recipient of the Abel Wolman Fellowship awarded by the Whiting School of Engineering at the Johns Hopkins University in 2007 and his research interests include wireless healthcare sensing systems, low-power embedded network system and protocol design, and the deployment of such embedded sensing systems to real environments.

Upon completing his Ph.D. in 2012 at Johns Hopkins, JeongGil Ko will continue his research at the Electronics and Telecommunications Research Institute (ETRI) in Daejeon, Korea as a research engineer.

March 30, 2012

The traditional supervised machine learning paradigm is inadequate for a wide array of potential machine learning applications where the learning algorithm decides on an action in the real world and gets feedback about that action. This inadequacy results in kludgy systems, such as for ad targeting at internet companies or deep systemic mistrust and skepticism such as for personalized medicine or adaptive clinical trials. I will discuss a new formal basis, algorithms, and practical tricks for doing machine learning in this setting.

Speaker Biography: John Langford is a computer scientist, working as a senior researcher at Yahoo! Research. He studied Physics and Computer Science at the California Institute of Technology, earning a double bachelor’s degree in 1997, and received his Ph.D. from Carnegie Mellon University in 2002. Previously, he was affiliated with the Toyota Technological Institute and IBM’s Watson Research Center. He is also the author of the popular Machine Learning weblog, hunch.net and the principle developer of Vowpal Wabbit.

Video Recording >>

April 3, 2012

From the activities of the US Patent Office or the National Institutes of Health to communications between scientists or political legislators, complex social processes—groups of people interacting with each other in order to achieve specific and sometimes contradictory goals—underlie almost all human endeavor. In order draw thorough, data-driven conclusions about complex social processes, researchers and decision-makers need new quantitative tools for exploring, explaining, and making predictions using massive collections of interaction data. In this talk, I will discuss the development of novel machine learning methods for modeling interaction data, focusing on the interplay between theory and practice. I will concentrate on a class of models known as statistical topic models, which automatically infer groups of semantically-related words (topics) from word co-occurrence patterns in documents. These topics can be used to detect emergent areas of innovation, identify communities, and track trends across languages. Until recently, most statistical topic models relied on two unchallenged prior beliefs. I will explain how challenging these beliefs increases the robustness of topic models to the skewed word frequency distributions common in document collections. I will also talk about a) the creation of a publicly-available search tool for National Institutes of Health (NIH) grants, intended to facilitate navigation and discovery of NIH-funded research, and b) a new statistical model of network structure and content for modeling interaction patterns in intra-governmental communication networks. Finally, I will briefly provide an overview of some of my ongoing and future research directions.

Speaker Biography: Hanna Wallach is an assistant professor in the Department of Computer Science at the University of Massachusetts Amherst. She is one of five core faculty members involved in UMass’s newly-formed computational social science research initiative. Previously, Hanna was a postdoctoral researcher, also at UMass, where she developed Bayesian latent variable models for analyzing complex data regarding communication and collaboration within scientific and technological communities. Her recent work (with Ryan Adams and Zoubin Ghahramani) on infinite belief networks won the best paper award at AISTATS 2010. Hanna has co-organized multiple workshops on Bayesian latent variable modeling and computational social science. Her tutorial on conditional random fields is widely referenced and used in machine learning courses around the world. As well as her research, Hanna works to promote and support women’s involvement in computing. In 2006, she co-founded the annual workshop for women in machine learning, in order to give female faculty, research scientists, postdoctoral researchers, and graduate students an opportunity to meet, exchange research ideas, and build mentoring and networking relationships. In her not-so-spare time, Hanna is a member of Pioneer Valley Roller Derby, where she is better known as Logistic Aggression.

Video Recording >>

April 10, 2012

The mapping of expression quantitative trait loci (eQTLs) has emerged as an important tool for linking genetic variation to changes in gene regulation. However, it remains difficult to identify the causal variants underlying eQTLs, and little is known about the regulatory mechanisms by which they act. We used DNase I sequencing to measure chromatin accessibility in 70 Yoruba lymphoblastoid cell lines, for which genome-wide genotypes and estimates of gene expression levels are also available. We obtained a total of 2.7 billion uniquely mapped DNase I-sequencing (DNase-seq) reads, which allowed us to infer transcription factor binding exploiting the specific DNase I cleavage footprint left on 827,000 sites corresponding to more than 100 factors. Across individuals, we identified 8,902 locations at which the DNase-seq read depth correlated significantly with genotype at a nearby locus (FDR = 10%). We call such genetic variants ‘DNase I sensitivity quantitative trait loci’ (dsQTLs). We found that dsQTLs are strongly enriched within inferred transcription factor binding sites and are frequently associated with allele-specific changes in transcription factor binding. A substantial number of dsQTLs are also associated with variation in the expression levels of nearby genes. Our observations indicate that dsQTLs are highly abundant in the human genome and are likely to be important contributors to phenotypic variation.”

Speaker Biography: Roger Pique-Regi received his telecommunications engineering degree from the Universitat Politecnica de Catalunya, BarcelonaTech, in 2002 and his Ph.D. degree in electrical engineering from the University of Southern California in 2009. His Ph.D. work was supported by the La Caixa Fellowship program. Since 2009, he has been a postdoctoral fellow in the Department of Human Genetics at the University of Chicago, where his research is supported by the Chicago Fellows program. He is a Member of the IEEE, the American Association for the Advancement of Science, and the International Society for Computational Biology. His research interests are in the area of genome signal processing, statistical machine learning, human genetics, and gene regulation.

Video Recording >>

April 17, 2012

Internet and personal photo collections now add up to over a trillion photos, with people being in most of them. The availability of so many photos presents a unique opportunity to model virtually the whole human population. Modeling humans is key to understanding how people interact with the environment, and to future human machine interfaces. In this talk, I will describe our work on 3D shape and motion estimation of a human face from large photo collections, as well as novel techniques for browsing those collections. This represents an exciting breakthrough towards modeling and visualizing any person just from their available photos. Part of this work is now included in Google’s Picasa.

Speaker Biography: Ira Kemelmacher-Shlizerman is a Postdoctoral researcher in the Department of Computer Science and Engineering at the University of Washington. She received her Ph.D in computer science and applied mathematics at the Weizmann Institute of Science in 2009. Dr. Kemelmacher-Shlizerman works in computer vision and graphics, with a particular interest in developing computational tools for modeling people from Internet and personal photo collections. Ira’s recent research was covered by stories in New Scientist, CBS, Discovery News and others. She was also consulting for Google.

Video Recording >>

April 26, 2012

Single living cells can chase targets, change shape in response to functional cues, and self-replicate. How can a simple group of molecules inside a cell orchestrate such complex behaviors? While individual molecules may have limited function, a network of chemical reactions can simulate a Turing machine, and it is widely believed that the complex control algorithms in cells are the result of networks of interactions involving many molecules. I’ll describe how we can investigate the power of such networks of molecular interactions by trying to reimplement computations, construction processes and control algorithms as molecular interactions involving synthetic DNA molecules. The chemistry and structure of DNA is well-understood, and we can engineer specific interactions between DNA molecules by designing their sequences. We can therefore focus on the dynamics of systems of interactions rather than the chemistry of individual interactions. I’ll show how we can use DNA to build a self-replicator whose alphabet is a series of DNA blocks and program a set of molecules to execute a “search and capture” process that can form tether between two points of unknown location. From these examples we learn that molecular reaction networks are surprisingly powerful: a relatively small set of molecules can both compute and learn arbitrarily complex patterns, and even though molecular interactions are stochastic and unreliable, we can design systems of molecules whose behavior is robust. I’ll close by describing some new work designing computational “morphogenesis” processes that could allow us to produce complex standing chemical patterns in 3 dimensions.

Speaker Biography: Rebecca Schulman studied computer science and mathematics at MIT, where she worked with Gerald Sussman’s group as part of the amorphous computing project. After a several year hiatus in Silicon Valley in which she helped start a company focused on natural language access to databases and where she wrote Linux software at Eazel, Dr. Schulman returned to graduate school to study molecular computation. She received her PhD in computation and neural systems from Caltech in 2007, where she studied under Erik Winfree. From 2008 to 2011 she was a Miller research fellow in the physics department at the University of California Berkeley. Dr. Schulman is currently an assistant professor in chemical and biomolecular engineering at Johns Hopkins; her group focuses on the design and characterization of complex self-assembly processes and the construction computational materials and structures.

June 25, 2012

Remote attestation is the process of securely verifying internal state of a remote hardware platform. It can be achieved either statically (at boot time) or dynamically, at run-time. Prior software-based techniques lack concrete security guarantees, while hardware-based approaches involve security co-processors that are too costly for low-end devices. In this work, we develop a new primitive (called SMART) based on bottom-up hardware/software co-design. SMART represents a simple, efficient and secure approach for establishing a dynamic root of trust in a remote embedded device. It is aimed at low-end micro-controller units (MCUs) that lack specialized memory management or protection features. SMART requires minimal changes to existing MCUs and assumes few restrictions on adversarial capabilities. We demonstrate both practicality and feasibility of SMART on two common MCU platforms.

Speaker Biography: Gene Tsudik is a professor of Computer Science at the University of California, Irvine (UCI). He received his PhD in Computer Science from USC in 1991. Before coming to UCI in 2000, he was at IBM Zurich Research Laboratory (1991-1996) and USC/ISI (1996-2000). Over the years, his research interests included many topics in security and applied cryptography. He currently serves as Director of Secure Computing and Networking Center (SCONCE) and Director of the Networked Systems (NetSys) Graduate Program at UCI. Since 2009, he is the Editor-in-Chief of ACM Transactions on Information and Systems Security (TISSEC).

July 20, 2012

Message authentication is well studied in the literature, and various efficient solutions have been suggested and analyzed. This is however not the case for authentication of data streams in which the message is very long, possibly infinite, and not known in advance to the sender. Trivial solutions for authenticating data streams either suffer from a long delay at the receiver’s end or cannot perform well when the communication channel is noisy.

We suggest an efficient, constant rate, authentication scheme for data streams over a noisy channel in the shared-randomness model. We show that for every given noise rate c < 1, there exists a scheme that correctly decodes and authenticates a (1−c)-fraction of the stream sent so far, with high probability. We also show that no constant-rate authentication scheme recovers more than a (1−c)-fraction of the stream, which implies that our scheme is optimal.

Joint work with Matthew Franklin, Rafail Ostrovsky, and Leonard Schulman.

Speaker Biography: Ran Gelles is a PhD student in the Computer Science Department at UCLA. He works in the Theory and Cryptography lab with Professor Rafail Ostrovsky and Professor Amit Sahai.

He received B.Sc. in Computer Engineering in 2003, and M.Sc in Computer Science in 2009, both from Technion, Israel.