Spring 2015

Video Recording >>

January 29, 2015

Security and performance are critical goals for distributed systems. The increased design complexity, incomplete expertise of developers, and limited functionality of existing testing tools often result in bugs and vulnerabilities that prevent implementations from achieving their design goals in practice. Many of these bugs, vulnerabilities, and misconfigurations manifest after the code has already been deployed making the debugging process difficult and costly. In order to understand the limitations and increase the robustness of distributed services and network protocols there is benefit in performing adversarial testing. Adversarial testing subjects implementations to testing beyond their basic functionality by stressing the system and ultimately performing destructive testing.

We introduce Turret a platform that provides support for automated adversarial testing for message-passing distributed systems and network protocols. The platform uses a network emulator to create reproducible network conditions and virtualization to run unmodified binaries of the target system. The platform requires the user to provide a description of the protocol messages and corresponding performance metrics. Turret supports distributed services such as intrusion-tolerant replication, application-layer multicast, and routing protocols, running in both wired and wireless networks. We applied Turret to 5 distributed systems and 5 wireless routing protocols and found a total of 70 attacks and bugs. We also discuss how we used Turret as an automated Red Team for our own protocols, and as a testing and grading environment for the distributed systems class at Purdue University.

Speaker Biography: Cristina Nita-Rotaru is an Associate Professor in the Department of Computer Science at Purdue University where she established the Dependable and Secure Distributed Systems Laboratory (DS2), and is a member of the Center for Education and Research in Information Assurance and Security (CERIAS). Her research lies at the intersection of information security, distributed systems, and computer networks. The overarching goal of her work is designing and building practical distributed systems and network protocols that are robust to failures and attacks while coping with the resource constraints existent in computing systems and networks.

Cristina Nita-Rotaru is a recipient of the NSF Career Award in 2006. She is also a recipient of the Purdue Teaching for Tomorrow Award in 2007, Purdue Excellence in Research Award, Seeds for Success in 2012, Purdue College of Science Research Award in 2013. She has served on the Technical Program Committee of numerous conferences in security, networking, and distributed systems. She served as an Assistant Director for CERIAS (2011 – 2013). She was an Associate Editor for Elsevier Computer Communications (2008 – 2011), Elsevier Computer Networks (2012 – 2014), IEEE Transactions on Computers (2011 – 2014), and ACM Transactions on Information Systems Security (2009 – 2013). She is currently an Associate Editor for IEEE Transactions on Mobile Computing and IEEE Transactions on Dependable and Secure Systems.

Video Recording >>

February 5, 2015

The emergence of social networks offers us an unprecedented opportunity to understand how our cities are constructed by showing us where we are connected and where we are divided. Since 2007 Dave Troy has been building maps to try to help us see cities differently: not as streets and buildings, but as a set of relationships and interactions between people. There are many implications to this work in fields ranging from urban planning to policing, economic development, and social justice. Currently this work is being done with fairly modest datasets, but with expanded coverage (Dave’s recent TED talk; 815K+ views) , additional, larger datasets are becoming available which require additional processing capacity. Dave will discuss the current state of the work, including some of the more interesting problems he faces, including scaling to map very large cities, presenting changes over time, and further automating the visualization process. He will discuss some possible avenues for exploration involving large scale map-reduce approaches to performing calculations such as community detection and graph layout. He is actively seeking collaborators who are be interested in working on these difficult challenges.

Speaker Biography: Dave Troy (JHU ’96) is a serial entrepreneur and community activist in Baltimore, Maryland. He is currently CEO, co-founder, and product architect at 410 Labs, maker of the popular e-mail management tool Mailstrom.co. He has been acknowledged by the founding team at Twitter as the first developer to utilize the Twitter API, with his project “Twittervision,” which was featured in the 2008 MoMA exhibition “Design and the Elastic Mind,” curated by Paola Antonelli. His new crowdsourced project Peoplemaps.org uses social network data to map cities. He is also organizer of TEDxMidAtlantic and is passionate about data, cities, and entrepreneurship. He lives in Baltimore with his wife and two children.

February 6, 2015

One of the most important problems in biology and medicine is determining the relationship between the sequence of a genome and the traits that it has, as this can provide insight into the risks for disease, the nature of development, and the forces of evolution. One of the most powerful methods we have for addressing this question is large-scale computational analysis of genome sequences within different tissues, populations, and species.

These questions have motivated us to develop several new, highly scalable algorithms and high performance systems for constructing and analyzing large collections of strings, trees, and graphs of biological sequence data. One of our main strengths is in the problem of de novo genome assembly, where the genome of a species is reconstructed from billions of short DNA sequencing reads through the construction and transformations of large sequence graphs. Recently we have been very focused on solving this problem using new single molecule sequencing technologies from PacBio and Oxford Nanopore that produce much longer reads, approaching 100,000 bp instead of mere hundreds, but suffer from very high error rates (15% to 40% error). Despite their low fidelity, we have developed the algorithms to overcome most errors and have used the data to assemble several very high quality microbial, plant, and animal genomes.

We have applied these and other techniques to gain new insights into the genetics of autism, the progression of cancer, and the evolution of plant and animal genomes. Looking forward, we have begun developing the computational theory and scalable systems to construct ‘the graph of life’: a graph encoding how a set of genomes relate to each other. In our preliminary work, we developed a new algorithm SplitMEM to analyze dozens of microbial genomes at once, but our ultimate goal is to scale these ideas to assemble a graph of all sequence variations in the human population.

Speaker Biography: Michael Schatz is an Associate Professor in the Simons Center for Quantitative Biology at Cold Spring Harbor Laboratory. His research focuses on developing scalable algorithms and systems for biological sequence analysis. Schatz received his Ph.D. in Computer Science from the University of Maryland in 2010, and his B.S. in Computer Science from Carnegie Mellon University in 2000, with 4 years at the Institute for Genomic Research (TIGR) in between. For more information, please visit his website at: http://schatzlab.cshl.edu

Video Recording >>

February 10, 2015

With the prevalence of mobile and wearable cameras and video-recorders, and global deployment of surveillance systems in space, air, sea, ground, and social network, the amount of unprocessed images and videos is massive, calling for a need for effective computational means for automatic analysis, understanding, summarization, and organization of visual data. In this talk, I will present some of our recent work on scalable machine learning approaches to image and video understanding. Specifically, I will focus on structured multitask methods for large-scale image classification, and online latent space methods for video analysis and summarization.

This is joint work with Bin Zhao.

Speaker Biography: Dr. Eric Xing is a Professor of Machine Learning in the School of Computer Science at Carnegie Mellon University. His principal research interests lie in the development of machine learning and statistical methodology; especially for solving problems involving automated learning, reasoning, and decision-making in high-dimensional, multimodal, and dynamic possible worlds in social and biological systems. Professor Xing received his Ph.D. in Computer Science from UC Berkeley. He is an associate editor of the Annals of Applied Statistics (AOAS), the Journal of American Statistical Association (JASA), the IEEE Transaction of Pattern Analysis and Machine Intelligence (PAMI), the PLoS Journal of Computational Biology, and an Action Editor of the Machine Learning Journal (MLJ), the Journal of Machine Learning Research (JMLR). He is a member of the DARPA Information Science and Technology (ISAT) Advisory Group, and a Program Chair of ICML 2014.

Video Recording >>

Distinguished Lecturer

February 12, 2015

Is the universe inherently deterministic or probabilistic? Perhaps more importantly – can we tell the difference between the two?

Humanity has pondered the meaning and utility of randomness for millennia. There is a remarkable variety of ways in which we utilize perfect coin tosses to our advantage: in statistics, cryptography, game theory, algorithms, gambling… Indeed, randomness seems indispensable! Which of these applications survive if the universe had no randomness in it at all? Which of them survive if only poor quality randomness is available, e.g. that arises from “unpredictable” phenomena like the weather or the stock market?

A computational theory of randomness, developed in the past three decades, reveals (perhaps counter-intuitively) that very little is lost in such deterministic or weakly random worlds. In the talk I’ll explain the main ideas and results of this theory. In particular we will discuss the basic notion(s) of “pseudo-randomness”, which turns out to capture fundamental problems in both Math and CS, and underlies a rapidly growing interaction area between them.

Speaker Biography: Avi Wigderson is Professor of Mathematics at the Institute for Advanced Study (Princeton), where he leads the Institute’s Computer Science and Discrete Math Program. He received his Ph.D. in computer science in 1983 from Princeton University. During 1986-2001 he held a permanent position at the Hebrew University Computer Science Institute and served as chair from 1992-95. Wigderson has held visiting positions at UC Berkeley, IBM Research, the Mathematical Sciences Research Institute, and the Institute for Advanced Study. He was an invited speaker at the International Congress of Mathematicians on two occasions, and was awarded the Nevanlinna Prize for outstanding contributions in mathematical aspects of information sciences in 1994. He gave the AMS Gibbs Lectures and received the AMS Conant Prize for mathematical exposition in 2008, and received the Gödel Prize, which recognizes outstanding papers in theoretical computer science, in 2009. Wigderson was elected to the American Academy of Arts and Sciences in 2011, and to the National Academy of Sciences in 2013.

Avi Wigderson works in the Theory of Computation, a field which studies the mathematical foundations of computer science. He is interested in the broad aspects of this exciting and rapidly developing field, including algorithms, Boolean and arithmetic circuit complexity, communication and proof complexity, cryptography, randomness, as well as the interactions of the field with other sciences including mathematics, physics, biology and economics.

Video Recording >>

February 17, 2015

Computing is undergoing a major shift. Third-party applications hosted in online software markets have become ubiquitous on all kinds of platforms: mobile phones, Web browsers, gaming devices, even household robots. These applications often include yet more third-party code for advertising, analytics, etc. These trends have dramatically increased the amount of bad code throughout the software stack – buggy code, malicious code, code that overcollects private information intentionally or by accident, overprivileged code vulnerable to abuse – as well as the amount of sensitive data processed by bad code.

In this talk, I will demonstrate that existing application platforms are ill-suited to dealing with bad code, thus causing security and privacy problems. I will then show how to isolate bad code without affecting its useful functionality, by redesigning the interfaces across the software stack and controlling the information released to the applications by the platform. I will also show how automated testing can identify bad code and help developers improve their applications.

Speaker Biography: Suman Jana is a postdoctoral researcher at Stanford University. He earned his PhD in 2014 from the University of Texas, where he was supported by the Google PhD Fellowship. He is broadly interested in identifying fundamental flaws in existing systems and building new systems with strong security and privacy guarantees. Suman received the 2014 PET Award for Outstanding Research in Privacy-Enhancing Technologies, Best Practical Paper Award from the 2014 IEEE Symposium on Security and Privacy (Oakland), Best Student Paper Award from the 2012 IEEE Symposium on Security and Privacy, and the 2012 Best Applied Security Paper Award. Suman’s research has been widely covered in popular media, and his code has been deployed at Google, Mozilla, and Apache.

Video Recording >>

February 19, 2015

Insecure computer systems in the wild can enable consequences ranging from crime to mass surveillance to (in the case of cyberphysical systems) physical destruction or even death. But how can anyone know if a particular computer system is insecure? One can rely on the representations of the system designers or manufacturers; however, the history of computers is replete with examples of claims that products are secure which are subsequently proven false. This is, in part, because computer systems tend to exhibit unanticipated, unintended, or poorly-understood behaviors that have complex interactions. As a result, the best way to learn about the security of a system is to take a detailed look at the hardware and software that comprise the system, and their interactions. In the common case where hardware designs and software source code are not available, reverse engineering the system is often the best way to derive ground-truth data on how the system functions.

In this talk, I’ll describe some of my recent research where reverse engineering played a key role, covering TLS implementations with backdoors as well as cyberphysical systems. I’ll also describe the scientific nature of reverse engineering as well as the positive, real-world impact reverse engineering can have on security and safety.

Speaker Biography: Stephen Checkoway is an Assistant Research Professor in the Department of Computer Science at Johns Hopkins University and a member of the Johns Hopkins University Information Security Institute. Checkoway’s research focuses on the security of embedded and cyberphysical systems. He has demonstrated exploitable vulnerabilities in such embedded systems as electronic voting machines, laptop webcams, automobiles, and airport scanners. He received his Ph.D. in Computer Science from the University of California, San Diego in 2012.

February 24, 2015

Large Internet services, “big science”, and increasingly, industrial R&D, are all powered by warehouse scale computing — tens of thousands of servers connected over a network. Increasing parallelism and big data analytics require that the network provide high throughput connectivity. In this talk, I will describe Jellyfish, our proposed design for such a network. Jellyfish uses a random graph topology, allowing construction at arbitrary sizes, easier network growth over time, and integration of newer and more powerful network equipment as it becomes available — practical problems that rigidly structured traditional networks fail to address. Surprisingly, Jellyfish also beats state-of-the-art real-world designs on throughput, by as much as 40%. In fact, we show that Jellyfish networks achieve near optimal throughput, i.e., one cannot build, using the same networking equipment, any network that provides much higher throughput.

Speaker Biography: Ankit Singla is a PhD Candidate in Computer Science at the University of Illinois at Urbana-Champaign. He received his Bachelors in Technology (Computer Science) at IIT Bombay, India, in 2008. He is a winner of the 2012 Google PhD Fellowship. These days, he is refining a plan for building a speed-of-light Internet, which he loses no opportunity to talk about.

Video Recording >>

February 26, 2015

Bitcoin is the first digital currency to see widespread adoption. While payments are conducted between pseudonyms, Bitcoin cannot offer strong privacy guarantees: payment transactions are recorded in a public decentralized ledger, from which much information can be deduced. In this talk I will discuss two new proposals to add privacy to Bitcoin. The first, called Zerocoin, tackles these problems by unlinking transactions from the payment’s origin. Yet, it still reveals payments’ destinations and amounts, and is limited in functionality. In the second approach Idescribe a full-fledged ledger-based digital currency with strong privacy guarantees. These results leverage recent advances in zero-knowledge Succinct Non-interactive ARguments of Knowledge (zk-SNARKs). Finally, I describe how these results can be used to construct advanced primitives such as decentralized anonymous credentials, which permit new privacy applications for decentralized systems.

Speaker Biography: Prof. Matthew Green is a Research Professor at the Johns Hopkins University Information Security Institute. His research focus is on cryptographic techniques for maintaining users’ privacy, and on technologies that enable the deployment of privacy-preserving protocols. From 2004-2011, Green served as CTO of Independent Security Evaluators, a custom security evaluation firm with a global client base. Along with a team at Johns Hopkins and RSA Laboratories, he discovered flaws in the Texas Instruments Digital Signature Transponder, a cryptographically-enabled RFID device used in the Exxon Speedpass payment system and in millions of vehicle immobilizers.

Video Recording >>

March 3, 2015

Over the past several decades, DRAM and hard disks have been the dominating computer memory and storage technologies. In recent years, emerging non-volatile memory (NVM) technologies such as flash are orders of magnitude faster than hard disks. Next-generation NVMs such as phase change memory and the memristor are byte-addressable and promise to have close-to-DRAM performance. These NVM technologies are poised to radically alter the performance landscape for storage systems, blurring the line between memory and storage. To fully utilize them and deliver their good performance to applications, system designers need to rethink how to efficiently manage them as reliable storage at different layers.

In this talk, I will discuss new challenges NVM technologies present to system designers and the systems that I have built to efficiently manage them. My approach is to vertically rethink different layers in a system and optimize them in a coherent way to deliver all the capabilities of the new NVM technologies. Specifically, I will present Mojim, a system that provides reliable and highly-available non-volatile main memory in data center environments. Mojim uses a flexible architecture and efficient software and networking stacks to achieve good data replication performance that is close to or even better than un-replicated single-node system. I will also present another system that I built for flash-based solid state drives (SSDs), where I remove excess indirection and its overhead in current SSDs. Finally, I will discuss my future research plans in building systems for “little” big data.

Speaker Biography: Yiying Zhang is a postdoctoral scholar in the Department of Computer Science and Engineering at the University of California, San Diego, where she works with Professor Steven Swanson in the Non-Volatile Systems Lab. Her research interests span operating systems, distributed systems, computer architecture, networking, and data analytics, with a focus on building fast, reliable, and flexible systems for emerging hardware and applications. She is currently interested in system implications of non-volatile main memory. In the past, she has worked in various aspects of storage systems, including removing excess indirection in storage systems, storage-level caching, and redundant arrays. Yiying received her Ph.D. from the Department of Computer Sciences at the University of Wisconsin-Madison in 2013, under the supervision of Professors Remzi Arpaci-Dusseau and Andrea Arpaci-Dusseau.

Video Recording >>

March 10, 2015

As data from far-reaching sources is collected, aggregated, and re-packaged to enable new and smarter applications, confidentiality and data security are at greater risk than ever before. Some of the most surprising and invasive threats to materialize in recent years are brought about by so-called inference attacks: successful attempts to learn sensitive information by leveraging public data such as social network updates, published research articles, and web APIs.

In this talk, I will focus on two of my research efforts to better understand and defend against these attacks. First I will discuss work that examines the privacy risks that arise when machine learning models are used in a popular medical application, and illustrate the consequences of applying differential privacy as a defense. This work uncovered a new type of inference attack on machine learning models, and shows via an in-depth case study how to understand privacy “in situ” by balancing the attacker’s chance of success against the likelihood of harmful medical outcomes. The second part of the talk will detail work that helps developers correctly write privacy-aware applications using verification tools. I will illustrate how a wide range of confidentiality guarantees can be framed in terms of a new logical primitive called Satisfiability Modulo Counting, and describe a tool that I have developed around this primitive that automatically finds privacy bugs in software (or proves that the software is bug-free). Through a better understanding of how proposed defenses impact real applications, and by providing tools that help developers implement the correct defense for their task, we can begin to proactively identify potential threats to privacy and take steps to ensure that they will not surface in practice.

Speaker Biography: Matt Fredrikson is a Ph.D. candidate in the department of Computer Sciences at the University of Wisconsin-Madison. His research interests lie at the intersection of security, privacy, and formal methods, covering topics in software security, privacy issues associated with machine-learning models, and applied cryptography. His work has been profiled by Reuters, Technology Review, and New Scientist, and received the best paper award at USENIX Security 2014. He is a recipient of the Microsoft Research Graduate Fellowship Award.

Video Recording >>

March 12, 2015

Today’s operating systems are large, complex, and plagued with vulnerabilities that allow perpetrators to exploit them for profit. The constant rise in the number of software weaknesses, coupled with the sophistication of modern adversaries, make the need for effective and adaptive defenses more critical than ever. In this talk, I will present my work on developing novel protection mechanisms and exploit prevention techniques that improve the security posture of commodity operating systems. In particular, I will discuss kGuard and XPFO, two projects whose goal is to harden contemporary OSes against attacks that exploit vulnerabilities in kernel code, without entailing extra software (e.g., hypervisor or VMM) or special hardware. In addition, I will talk about ret2dir, a new kernel exploitation technique that I developed, which uncovered how fundamental OS design practices and implementation decisions can significantly weaken the effectiveness of state of-the-art kernel protection mechanisms.

Speaker Biography: Vasileios (Vasilis) Kemerlis is a PhD candidate in the Department of Computer Science at Columbia University. His research interests are in the areas of systems and software security, with a focus on OS kernel protection, automated software hardening, and information-flow tracking. His work on kernel exploitation has been profiled by press and social media outlets, including Dark Reading, Hacker News, and Reddit, won the first prize in the Applied Security Research Paper competition, at the Cyber Security Awareness Week (CSAW) 2014, and led to the adoption of kernel hardening techniques from OpenBSD and Qualcomm’s MSM Android. Vasilis holds a MPhil (2013) and MS (2010) in Computer Science from Columbia University, and a BS (2006) in Computer Science from Athens University of Economics and Business.

Video Recording >>

March 24, 2015

Empirical scientists seek not just surface descriptions of the observed data, but deeper explanations of why things happened the way they did, and how the world would be like had things happened differently. With the unprecedented accumulation of data (or, “big data”), researchers are becoming increasingly aware of the fact that traditional statistical techniques, including those based on artificial intelligence and machine learning, must be enriched with two additional ingredients in order to construct such explanations:

  1. the ability to integrate data from multiple, heterogeneous sources, and
  2. the ability to distinguish causal from associational relationships.

In this talk, I will present a theory of causal generalization that provides a principled way for fusing pieces of empirical evidence coming from multiple, heterogeneous sources. I will first introduce a formal language capable of encoding the assumptions necessary to express each problem instance. I will then present conditions and algorithms for deciding whether a given problem instance admits a consistent estimate for the target effects and, if feasible, fuse information from various sources to synthesize such an estimate. These results subsume the analyses conducted in various fields in the empirical sciences, including “external validity,” “meta-analysis,” “heterogeneity,” “quasi-experiments,” “transportability,” and “sampling selection bias.” I will conclude by presenting new challenges and opportunities opened by this research.

Speaker Biography: Elias Bareinboim is a postdoctoral scholar (and was a Ph.D. student) in the Computer Science Department at the University of California, Los Angeles, working with Judea Pearl. His interests are in causal and counterfactual inferences and their applications. He is also broadly interested in artificial intelligence, machine learning, robotics, and philosophy of science. His doctoral thesis provides the first general framework for solving the generalizability problems in causal inference — which has applications across all the empirical sciences. Bareinboim’s recognitions include the Dan David Prize Scholarship, the Yahoo! Key Scientific Challenges Award, the Outstanding Paper Award at the 2014 Annual Conference of the American Association for Artificial Intelligence (AAAI), and the Edward K. Rice Outstanding Graduate Student.


March 27, 2015

Medical ultrasound (US) imaging is a popular and convenient medical imaging modality thanks to its mobility, non-ionizing radiation, ease-of-use, and real-time data acquisition. Conventional US brightness mode (B-Mode) is one type of diagnostic medical imaging modality that represents tissue morphology by collecting and displaying the intensity information of a reflected acoustic wave. Moreover, US B-Mode imaging is frequently integrated with tracking systems and robotic systems in image-guided therapy (IGT) systems. Recently, these systems have also begun to incorporate advanced US imaging such as US elasticity imaging, photoacoustic imaging, and thermal imaging. Several software frameworks and toolkits have been developed for US imaging research and the integration of US data acquisition, processing and display with existing IGT systems. However, there is no software framework or toolkit that supports advanced US imaging research and advanced US IGT systems by providing low-level US data (channel data or radio-frequency (RF) data) essential for advanced US imaging. In this dissertation, we propose a new medical US imaging and interventional component framework for advanced US image-guided therapy based on network-distributed modularity, real-time computation and communication, and open-interface design specifications. Consequently, the framework can provide a modular research environment by supporting communication interfaces between heterogeneous systems to allow for flexible interventional US imaging research, and easy reconfiguration of an entire interventional US imaging system by adding or removing devices or equipment specific to each therapy. In addition, our proposed framework offers real-time synchronization between data from multiple data acquisition devices for advanced interventional US imaging research and integration of the US imaging system with other IGT systems. Moreover, we can easily implement and test new advanced ultrasound imaging techniques inside the proposed framework in real-time because our software framework is designed and optimized for advanced ultrasound research. The system’s flexibility, real-time performance, and open-interface are demonstrated and evaluated through performing experimental tests for several applications.

Speaker Biography: Hyun Jae Kang received his B.S.E degree in Mechanical engineering at Kyungpook National University (Daegu, Korea), his M.S.E degree in Mechanical Design & Production Engineering at Seoul National University (Seoul, Korea), and his M.S.E. degree in Computer Science at the Johns Hopkins University. Before coming to JHU, he worked as a R&D team manager and researcher in the image-guided surgical navigation team at CyberMed Inc. (Seoul, Korea) and was involved in the development of surgical navigation systems for neurosurgery, image-free total knee replacement and dental implants, as well as 16 patents. He is currently a Ph.D. candidate in Computer Science at JHU advised by Dr. Emad M. Boctor. During his Ph.D., his research has been dedicated to a real-time medical ultrasound imaging system and an interventional component-based framework for advanced ultrasound guided surgical systems. His research interests include modular software frameworks, network-distributed systems, and medical ultrasound image reconstruction.

Video Recording >>

April 7, 2015

Storage services form the platform on which widely-used cloud services, mobile applications, data analytics engines, and transactional databases are built. Such services are trusted with irreplaceable personal and commercial information by users, companies, and even governments.

The designers of storage services often have to choose between performance and reliability. If the developer makes the system reliable, performance is often significantly reduced. If the developer instead maximizes performance, a crash could lead to data loss and corruption.

In this talk, I describe how to build systems that achieve both strong reliability and high performance. In many systems, reliability is maintained by carefully ordering updates to storage. The key insight is that the low-level mechanism used to enforce ordering is overloaded: it provides durability as well as ordering. I introduce a new primitive, osync(), that decouples ordering from durability of writes. I present Optimistic Crash Consistency, a new crash-recovery protocol that builds on osync() to provide strong reliability guarantees and high performance. I implement these techniques in the Optimistic File System (OptFS) and show that it provides 10X increased performance for some workloads. With researchers in Microsoft, I employ the principles of Optimistic Crash Consistency in a distributed storage system, resulting in 2-5X performance improvements.

Speaker Biography: Vijay Chidambaram is a Ph.D candidate in the Department of Computer Sciences at the University of Wisconsin-Madison. His current research focus is to ensure the reliability of applications in the rapidly changing landscape of storage and cloud computing. Specifically, he has contributed new reliability techniques in (local and distributed) storage systems, and built frameworks for finding reliability bugs in applications. His work has resulted in patent applications by Samsung and Microsoft. He was awarded the Microsoft Research Fellowship in 2014, and the University of Wisconsin-Madison Alumni Scholarship in 2009.


April 9, 2015

The field of robotics is rapidly approaching a paradigm shift in how human beings collaborate, program, monitor, correct and generally interact with robotic systems. Several factors are driving this paradigm shift. First, robotic systems are becoming more capable: The capability of a robot has expanded from limited movement and grasping to complex sensing using force and vision and advanced manipulation. Many new systems are also safe for physical interaction. Along with improvements in hardware have come drastic advances in algorithms for motion planning, object detection and other perception, manipulation and human tracking.

This increase in capability has allowed robots to be used for more and more complex tasks, and in a wider range of use domains. Initially almost all robots were used for large scale industrial manufacturing, where they predominantly saw use for structured tasks such as material handling, welding and painting. However as robotic systems become more capable and expand out of the structured, interaction-free world of large scale manufacturing, and into the unstructured human-centric environments of small scale manufacturing and in-home assistance, we must re-think how humans interact with these systems.

We present several methods for improving human instruction, collaboration and general interaction with robotic systems by following the model of human apprenticeship. We show that such interaction between novice human users and robots is possible by formal modeling of the capabilities of the robot, providing specific information to these capabilities to make them task relevant, and providing representation for building a task from these capabilities in a manner that enables the generalization and reuse of task information. We further use advances in Virtual Reality to provide a dynamic and intuitive robot interaction and instruction environment.

Speaker Biography: Kelleher Guerin received his Bachelors of Science in Optical Engineering from the University of Arizona, and a Masters of Science and Engineering in Electrical Engineering from Carnegie Mellon University. His work has included the design of optical systems for space applications, robot visions for lunar excursion, robotic mining applications, large-scale wall displays, and robotic systems for small manufacturers. He is focused primarily on problems in Human-Robot Interaction, Human-Machine Interaction and Robotics for Manufacturing.

Video Recording >>

April 16, 2015

Many datasets are plagued by unobserved confounders: hidden but relevant variables. The presence of hidden variables obscures many conditional independence constraints in the underlying model, and greatly complicates data analysis. In this talk I consider a type of equality constraint which generalizes conditional independence, and which is a “natural” equality constraint for data generated from the marginal distribution of a DAG graphical model.

I discuss applications of such constraints to recovering causal structure from data, and statistical inference in hidden variable models. To this end, I introduce a new kind of graphical model, called the nested Markov model, which is defined via these constraints just as Bayesian networks and Markov random fields are defined via conditional independence constraints.

I describe parameterizations for nested Markov models with discrete state spaces, together with parameter and structure learning algorithms. I show cases where a single generalized equality constraint is sufficient to completely recover a nested Markov model, and thus the corresponding family of hidden variable DAGs.

Part of this material is based on joint work with Thomas S. Richardson, James M. Robins, and Robin J. Evans.

Part of this material is based on joint work with Judea Pearl.

Speaker Biography: Ilya Shpitser is a Lecturer in Statistics at the University of Southampton. His interests include inference in complex multivariate settings, causal inference from observational data, particularly in longitudinal settings, and foundational issues in causality. Previously, Ilya was a research fellow at the causal inference group headed by James M. Robins at the Harvard School of Public Health. He received his PhD under the supervision of Judea Pearl at UCLA.


April 16, 2015

Since the 1990s, scripting languages such as Python and JavaScript have grown rapidly in popularity. This is due in large part to their flexibility and expressiveness: scripts use a terse syntax, provide highly abstract operations, and allow ad-hoc manipulation of data. Unfortunately, this comes at the cost of performance and scalability. All widely-used scripting languages today are dynamically typed, which severely limits the effectiveness of compile-time optimizations and prevents valuable static analysis tools from being used in debugging. Attempts to retrofit type systems and other static analyses onto scripting languages have had limited success; existing scripting languages were designed without these tools in mind and so contain fundamentally dynamic operations that defy analysis.

This talk gives an overview of the BigBang project, which takes a different approach: we design a scripting language from scratch. We observe that the fundamentally dynamic operations in scripting languages are not necessary to give them their expressive and productive feel. Instead, we select an unusual set of core operations, careful to ensure that they are statically typable, and use them to encode common scripting language functionality. Traditional type theories are unable to capture these operations, so we discuss how techniques from abstract interpretation can be adopted into type theory to produce a sound and decidable type system.

Speaker Biography: Zachary Palmer is a Ph.D. candidate in Computer Science at Johns Hopkins University, advised by Dr. Scott F. Smith. His life as a graduate student started following a carefully-planned and lucky escape from an industry job as a software engineer. He plans to continue pursuing his interests in typed scripting languages, compile-time metaprogramming, and teaching in his position as a visiting professor at Swarthmore College starting in the Fall semester.

Video Recording >>

Distinguished Lecturer

April 23, 2015

The lion’s share of bacteria in various environments cannot be cloned in the laboratory and thus cannot be sequenced using existing technologies. A goal of single-cell genomics (SCG) is to complement gene-centric metagenomic data with whole-genome assemblies of uncultivated organisms. Assembly of SCG data is challenging because of highly non-uniform read coverage and highly elevated levels of chimeric reads/read-pairs. We describe SPAdes, an assembler for both SCG and standard (multicell) assembly that incorporates a number of new algorithmic ideas. We demonstrate that recently developed single-cell assemblers not only enable single-cell sequencing, but also improve on conventional assemblers on their own turf. We further describe (i) TrueSPAdes that assembles accurate and long (10Kb) reads generated by the recently released Illumina TrueSeq technology, (ii) transSPAdes for transcriptome assembly, and (iii) dipSPAdes for assembling highly polymorphic diploid genomes. Finally, we show that the de Bruijn graph assembly approach is well suited to assembling long and highly inaccurate SMRT reads generated by Pacific Biosciences.

Speaker Biography: Dr. Pevzner is Ronald R. Taylor Distinguished Professor of Computer Science and Director of the National Technology Center for Computational Mass Spectrometry at University of California, San Diego. He holds Ph.D. (1988) from Moscow Institute of Physics and Technology, Russia. He was named Howard Hughes Medical Institute Professor in 2006. He was elected the Association for Computing Machinery Fellow (2010) for “contribution to algorithms for genome rearrangements, DNA sequencing, and proteomics” and the International Society for Computational Biology Fellow (2012). He was awarded a Honoris Causa (2011) from Simon Fraser University in Vancouver. Dr. Pevzner has authored textbooks “Computational Molecular Biology: An Algorithmic Approach” in 2000, “Introduction to Bioinformatics Algorithms” in 2004 (with Neal Jones), and “Bioinformatics Algorithms: An Active Learning Approach” in 2014 (with Phillip Compeau).

April 28, 2015

We give algorithms for regression for a wide class of M-Estimator loss functions. These generalize lp-regression to fitness measures used in practice such as the Huber measure, which enjoys the robustness properties of l1 as well as the smoothness properties of l_2. For such estimators we give the first input sparsity time algorithms. Our techniques are based on the sketch and solve paradigm. The same sketch works for any M-Estimator, so the loss function can be chosen after compressing the data.

Joint work with Ken Clarkson.

Speaker Biography: David Woodruff received his Ph.D. from MIT in 2007 and has been a research scientist at IBM Almaden since then. His research interests are in big data, including communication complexity, compressed sensing, data streams, machine learning, and numerical linear algebra. He is the author of the book “Sketching as a Tool for Numerical Linear Algebra”. He received best paper awards in STOC and PODS, and the Presburger award.

Video Recording >>

May 5, 2015

Mobile and ubiquitous computing research has led to new techniques for cheaply, accurately, and continuously collecting data on human behavior that include detailed measurements of physical activities, social interactions and conversations, as well sleep quality and duration. Continuous and unobtrusive sensing of behaviors has tremendous potential to support the lifelong management of mental illnesses by: (1) acting as an early warning system to detect changes in mental well-being, (2) delivering context-aware, personalized micro-interventions to patients when and where they need them, and (3) by significantly accelerating patient understanding of their illness. In this presentation, I will give an overview of our work on turning sensor-enabled mobile devices into well-being monitors and instruments for administering real-time/real-place interventions.

Speaker Biography: Tanzeem Choudhury is an associate professor in Computing and Information Sciences at Cornell University. She directs the People-Aware Computing group, which works on inventing the future of technology-assisted wellbeing. Tanzeem received her PhD from the Media Laboratory at MIT and BS in Electrical Engineering from University of Rochester. Tanzeem was awarded the MIT Technology Review TR35 award, NSF CAREER award and a TED Fellowship. For more information visit: http://pac.cs.cornell.edu and follow the group’s work on twitter @pac_cornell

Video Recording >>

May 19, 2015

Speaker Biography: Benjamin Van Durme is an Assistant Research Professor in Computer Science, and the head of the Natural Language Understanding effort at the Human Language Technology Center of Excellence. His research focuses on methods for extracting implicit and explicit knowledge from human communication. His work ranges from computational natural language semantics to applications of streaming and randomized algorithms for large scale data mining. He is concerned with building large, composable systems, lately focused around the development and use of Concrete, a Thrift-backed schema for constructing HLT system pipelines, with supporting libraries in various languages (http://hltcoe.github.io). This platform grew out of the JHU HLTCOE SCALE 9-10 week Summer workshops he co-led in 2012 (25+ participants) and led in 2013 (40+ participants). He is the co-developer of the largest collection of paraphrases in the world (http://paraphrase.org), the largest released resource of text processed for extraction (decades of articles from newspapers such as the NYTimes), and the fastest system in the world for searching large collections of audio for matching keywords. He has degrees from the University of Rochester (BS, BA ’01, MS ’06, PhD ’09) and Carnegie Mellon University (MS ’04) in Computer Science, Cognitive Science, Linguistics and Language Technologies. He has performed internships at NIST, BAE Systems, Lockheed Martin’s Advanced Technology Laboratory AI Divison (1yr full-time), and Google Research (2 Summers). His research is supported by the NSF, DARPA, Vulcan, and recently by a Google Faculty Award.


May 27, 2015

Topic models, statistical models that describe low-dimensional representations of data, can uncover interesting latent structure in large text datasets and are popular tools for automatically identifying prominent themes in text. This talk will introduce topic models that can encode additional structures such as factorizations, hierarchies, and correlations of topics, and can incorporate supervision and domain knowledge. This is achieved by formulating the Bayesian priors over parameters as functions of underlying components, which can be constrained in various ways to induce different structures. This approach is first introduced through a topic model called factorial LDA, which models a factorized structure in which topics are conceptually arranged in multiple dimensions. Factorial LDA can be used to model multiple types of information, for example topic and sentiment in reviews. We then introduce a family of structured-prior topic models called SPRITE, which creates a unifying representation that generalizes factorial LDA as well as other existing topic models, and creates a powerful framework for building new models. I will also show how these topic models can be used in various scientific applications, such as extracting medical information from forums, measuring healthcare quality from patient reviews, and monitoring public opinion in social media.

Speaker Biography: Michael Paul is a PhD candidate in Computer Science at Johns Hopkins University. Beginning in August 2015, he will be an Assistant Professor of Information Science and Computer Science at the University of Colorado, Boulder. He earned a B.S. in CS from the University of Illinois at Urbana-Champaign in 2009 and an M.S.E. in CS from Johns Hopkins University in 2012. He has received PhD fellowships from Microsoft Research, the National Science Foundation, and the Johns Hopkins University Whiting School of Engineering. His research focuses on exploratory machine learning and natural language processing for the web and social media, with applications to computational epidemiology and public health informatics.


June 1, 2015

Better instruments, faster and bigger supercomputers and easier collaboration and sharing of data in the sciences have introduced the need to manage increasingly large datasets. Advances in high-performance computing (HPC) have empowered many science disciplines’ computational branches. However, many scientists lack access to HPC facilities or the necessary sophistication to develop and run HPC codes. The benefits of testing new theories and experimenting with large numerical simulations have thus been restricted to a few top users. In this dissertation, I describe the “remote immersive analysis” approach to computational science and present new techniques and methods for the efficient evaluation of scientific analysis tasks in analysis cluster environments.

I will discuss several techniques developed for the efficient evaluation of data-intensive batch-queries in large numerical simulation databases. An I/O streaming method for the evaluation of decomposable kernel computations utilizes partial-sums to evaluate a batch-query by performing a single sequential pass over the data. Spatial filtering computations, which use a box filter, share not only data, but also computation and can be evaluated over an intermediate summed volumes dataset derived from the original data. This is more efficient for certain workloads even when the intermediate dataset is computed dynamically. Threshold queries have immense data requirements and potentially operate over entire time-steps of the simulation. An efficient and scalable data-parallel approach evaluates threshold queries of fields derived from the raw simulation data and stores their results in an application-aware semantic cache for fast subsequent retrieval. Finally, synchronization at a mediator, task-parallel and data-parallel approaches for the evaluation of particle tracking queries are compared and examined.

These techniques are developed, deployed and evaluated in the Johns Hopkins Turbulence Databases (JHTDB), an open simulation laboratory for turbulence research. The JHTDB stores the output of world-class numerical simulations of turbulence and provides public access to and means to explore their complete space-time history. The techniques discussed implement core scientific analysis routines and significantly increase the utility of the service. Additionally, they improve the performance of these routines by up-to an order of magnitude or more when compared with direct implementations or implementations adapted from the simulation code.

Speaker Biography: Kalin Kanov was born in Dimitrovgrad, Bulgaria on October 11th, 1982. He graduated with distinction from the University of Virginia in 2006 with a B.A. degree in Astronomy and Physics and a minor in Computer Science. He was awarded the Limber award, given to the most outstanding Astronomy graduate of the class of 2006. After graduating from UVA he interned at NASA’s Goddard Space Flight Center and worked at Perrin Quarles Associates on the development of the Emissions Collection and Monitoring Plan System for the U.S. EPA Clean Air Markets Division.

Kalin enrolled in the Computer Science Ph.D. program at Johns Hopkins University in 2008. His research has focused on the development of methods for the efficient evaluation of batch-queries for large numerical simulation datasets. During internships at Los Alamos National Laboratory and Google he worked on large scientific database systems and evaluation techniques for complex arithmetic expressions over dataset features partitioned into columns.


June 5, 2015

The need to compute an accurate spatial alignment between multiple representations of patient anatomy is a problem that is fundamental to many applications in computer assisted interventional medicine. One class of methods for computing such alignments is feature-based registration, which aligns geometric information of the shapes being registered, such as point-based landmarks or models of shape surfaces. A popular algorithm for feature-based registration is the Iterative Closest Point (ICP) algorithm, which treats one shape as a cloud of points that is registered to a second shape by iterating between point-correspondence and point-registration phases until convergence. In this presentation, a class of “most likely point” variants on the ICP algorithm are presented that offer several advantages over prior ICP-based methods, such as high registration accuracy and the ability to confidently assess the quality of a registration outcome. The proposed methods are based on a probabilistic interpretation of the registration problem, wherein the shape alignment is optimized based on uncertainty models rather than minimizing the Euclidean distance between the shapes, as in ICP. This probabilistic framework is used to model anisotropic uncertainties in the measured shape features and to provide a natural context for incorporating additional features into the registration problem, such as orientations representing shape surface normals. The proposed algorithms are evaluated through simulation- and phantom-based studies, which demonstrate significant improvement in registration outcomes relative to ICP and other state-of-the-art methods.

Speaker Biography: Seth Billings is a PhD candidate in Computer Science at Johns Hopkins University, being advised by Dr. Russell Taylor and co-advised by Dr. Emad Boctor. Seth received his M.S.E degree in Computer Science from Johns Hopkins University and a B.S. degree in Electrical Engineering from Kettering University. Seth’s research at JHU has focused on systems and algorithms for computer assisted surgery. His primary research focus has been the development of feature-based registration algorithms that incorporate probabilistic frameworks for registering oriented feature data and for modeling anisotropic uncertainty in the features being registered. His research experience also includes development of an embedded multispectral light source to limit phototoxicity during retinal surgery, development of control software for the Laparoscopic Assistant Robot System (LARS), and development of a semi-autonomous surgeon-assistance mode for performing robot-assisted, laparoscopic, ultrasound-based elastography imaging with the da Vinci Surgical System. Seth also has prior experience as a full-time product engineer at the Research & Development division of Dematic, where he developed embedded systems for industrial automation following his graduation from Kettering University.


June 15, 2015

Confinement is a security policy that restricts the outward communication of a subsystem to authorized channels. It stands at the border of mandatory and discretionary policies and can be used to implement either. In contrast to most security policies, confinement is composable. In capability-based systems, confinement is validated by a simple decision procedure on newly minted subsystems. However, there is a long-standing debate in the literature as to whether confinement is enforceable in capability-based systems. All previous attempts to demonstrate confinement have arrived at negative results, either due to flawed system models or to proof errors that have not survived inspection.

This presentation describes SDM: a formal, general, and extensible system model for a broad class of capability-based systems. SDM includes: 1) a mechanical formalization for reasoning about capability-based systems that produces a machine-checked proof of the safety problem, 2) the construction of a system-lifetime upper-bound on potential information flow based on the safety property, and 3) an embedding of the confinement test for capability-based systems and the first mechanically verified proof that such systems support confinement. All proofs in SDM are constructed using the Coq proof assistant without using or declaring axioms that are not part of the core logic. In consequence, there is no portion of the specification which relies on an uninstantiable assertions. SDM further distinguishes itself from other efforts by enabling the formal specifications of security-enforcing applications to be embedded without being injected into the system semantics.

Speaker Biography: M. Scott Doerrie is a Ph.D. candidate in Computer Science at Johns Hopkins University advised by Jonathan Shapiro. His cross-domain research draws from operating systems, access control mechanisms, machine-assisted verification, safe language run-times, and advanced type systems. He is specifically interested in building mechanisms to produce secure systems.