Fall 2017


September 11, 2017

Data streams have emerged as a natural computational model for numerous applications of big data processing. In this model, algorithms are assumed to have access to a limited amount of memory and can only make a single pass (or a few passes) over the data, but need to produce sufficiently accurate answers for some objective functions on the dataset. This model captures various real world applications and stimulates new scalable tools for solving important problems in the big data era. In my PhD study, I focus on the following three aspects of the streaming model.

  1. Understanding the capability of the streaming model. For a vector aggregation stream, i.e., when the stream is a sequence of updates to an underlying n-dimensional vector v (for very large n), we establish nearly tight space bounds on streaming algorithms of approximating functions of the form \sum{i=1}^n g(vi) for nearly all functions g of one-variable and l(v) for all symmetric norms l. These results provide a deeper understanding of the streaming computation model.

  2. Tighter upper bounds. We provide better streaming k-median clustering algorithms in a dynamic points stream, i.e., a stream of insertion and deletion of points on a discrete Euclidean space ([\Delta]^d for sufficiently large \Delta and d). Our algorithms use k poly(d \log \Delta) space/update time and maintain with high probability an approximate k-median solution to the streaming dataset. All previous algorithms for computing an approximation for the k-median problem over dynamic data streams required space and update time exponential in d.

  3. Applications in Machine Learning. We develop an efficient algorithm for stochastic data streams, i.e., the data points are drawn from a distribution. Stochastic streaming algorithms have broader applications in machine learning tasks. We present a novel framework that dynamically obtains the spectral clusters of a stochastic stream that is generated from a random walk over a network. We show that our algorithm achieves tight space/sample bounds.

Speaker Biography: Lin Yang is currently a PhD student in the Computer Science Department of Johns Hopkins University. He has defended his Ph.D. degree in Physics in this May. He obtained his Ms.E. degree in Computer Science in 2015 and received his Bacheler’s degree from Tsinghua University, China. In computer science, he works in the area of sublinear time/space algorithms, in particular, algorithms for streaming data analysis. He studies the fundamental complexity of algorithms on data streams as well as designs new and better algorithm for important problems. Lin’s research papers are published in top conferences, such as STOC, ICML and PODS. He will be joining the Department of Operations Research and Financial Engineering of Princeton University, as a postdoc researcher. Lin Yang’s advisors are Professor Vova Braverman and Professor Alex Szalay.

Advisor: Vladimir Braverman


September 18, 2017

Among many medical imaging modalities including X-ray computed tomography (CT) and magnetic resonance imaging (MRI), medical ultrasound possesses its unique advantage of non-ionizing, real-time, and non-invasive properties. With its safeness, ease of use, and cost-effectiveness, ultrasound imaging has been used in wide variety of diagnostic applications. Further, by the emergence of photoacoustic imaging, ultrasound and photoacoustic imaging can comprehensively depict not only anatomical but also functional information of biological tissue. Photoacoustic imaging is hybrid imaging modality merging light and ultrasound, and reveals the tissue metabolism and molecular distribution with the utilization of endo- and exogenous contrast agents.

To broaden the impact of translational ultrasound and photoacoustic imaging, I have devoted my effort on developing enabling technologies and exploring associated applications. Particularly, this dissertation focuses on; (1) Enabling Technologies for Translational Photoacoustic Imaging We investigated the potential of maximizing the access to translational photoacoustic imaging instead of using widely used customized data acquisition system and expensive high power laser, but by using clinical ultrasound scanner and low-cost light source. (2) Co-robotic Ultrasound and Photoacoustic Imaging We introduced a co-robotic ultrasound paradigm to make ultrasound and photoacoustic imaging more comprehensive and capable of imaging deep tissue in higher resolution and wider field-of-view. (3) Advancements on Translational Photoacoustic Imaging We explored the new use of translational photoacoustic imaging for molecular-based cancer detection and sensing neurotransmitter activity in brain.

Speaker Biography: Haichong “Kai” Zhang was born in China, and raised in Japan. He received a B.S. and M.S. in Human Health Sciences (Laboratory Science) from Kyoto University, Japan in 2011 and 2013, respectively. After completing his master degree, he joined the Ph.D. program at The Johns Hopkins University, and received M.S. in Computer Science in 2015. His research interests include medical imaging related to medical ultrasound, robotics, photoacoustics. He has authored or presented more than 9 journal articles, 28 conference proceedings, 12 other abstracts, and 4 pending/issued patents.

Video Recording >>

September 21, 2017

In this talk I will summarize a few recent developments in the design and analysis of sequence models. Starting with simple parametric models such as HMMs for sequences we look at nonparametric extensions in terms of their ability to model more fine-grained types of state and transition behavior. In particular we consider spectral embeddings, nonparametric Bayesian models such as the nested Chinese Restaurant Franchise and the Dirichlet-Hawkes Process. We conclude with a discussion of deep sequence models for user return time modeling, time-dependent collaborative filtering, and large-vocabulary user profiling.

September 29, 2017

The objective of the seminar is to impart lessons learned and some future computer science and information technology thinking regarding my and your computer science careers. Beginning with the very early days of programming in BASIC and assembly language, best practices and success concepts have appeared repeatedly including; “Be So Good They Can’t Ignore You”, “Be prepared”, and “Embrace and Manage Change”. Examples and stories are used to highlight these and other useful skill sets.

Speaker Biography: Jon Weston, BS Business Administration, JHU 1968, is currently the Vice President of Solution Architecture for CACI. Over the last 50 years he has been a computer programmer, instructor, system engineer, director of operations, enterprise architect, product planner/marketing manager, project manager and author/lecturer. During the last 25 years, he has been the system engineer/solution architect at CACI, a 19,000 person computer consulting company. He has designed and built operating system capabilities; designed, developed and delivered computer systems (infrastructure and applications) for business systems, intelligence systems, logistics, process control, networks and products for industry and government. Along the way he has developed and implemented successful software development and systems/services delivery lifecycle methodologies and procedures used across a wide array of customer environments and disciplines including planning, development, integration, operations and maintenance, cyber security, benchmarking, predictive analytics and full program modeling/estimating. He also has been a lacrosse player and coach including his current position as assistant lacrosse coach at Southwestern University, Georgetown, Texas

Video Recording >>

Slides >>

Distinguished Lecturer

October 5, 2017

Most information is now published in complex, structured, evolving datasets or databases. As such, there is increasing demand that this digital information should be treated in the same way as conventional publications and cited appropriately. While principles and standards have been developed for data citation, they are unlikely to be used unless we can couple the process of extracting information with that of providing a citation for it. I will discuss the problem of automatically generating citations for data in a database given how the data was obtained (the query) as well as the content (the data), and show how the problem of generating a citation is related to two well-studied problems in databases: query rewriting using views and provenance.

Speaker Biography: Susan B. Davidson received the B.A. degree in Mathematics from Cornell University in 1978, and the M.A. and Ph.D. degrees in Electrical Engineering and Computer Science from Princeton University in 1980 and 1982. Dr. Davidson is the Weiss Professor of Computer and Information Science (CIS) at the University of Pennsylvania, where she has been since 1982, and currently serves as Chair of the board of the Computing Research Association.

Dr. Davidson’s research interests include database and web-based systems, scientific data management, provenance, crowdsourcing, and data citation.

Dr. Davidson was the founding co-director of the Penn Center for Bioinformatics from 1997-2003, and the founding co-director of the Greater Philadelphia Bioinformatics Alliance. She served as Deputy Dean of the School of Engineering and Applied Science from 2005-2007 and Chair of CIS from 2008-2013. She is an ACM Fellow, Corresponding Fellow of the Royal Society of Edinburgh, and received a Fulbright Scholarship and Hitachi Chair in 2004. Her awards include the 2017 IEEE TCDE Impact Award for “expanding the reach of data engineering within scientific disciplines”, and the 2015 Trustees’ Council of Penn Women/Provost Award for her work on advancing women in engineering.

Distinguished Lecturer

October 12, 2017

Probabilistic algorithms for both decision and search problems can offer significant complexity improvements over deterministic algorithms. One major difference, however, is that they may output different solutions for different choices of randomness. This makes correctness amplification impossible for search algorithms and is less than desirable in setting where uniqueness of output is important such as generation of system wide cryptographic parameters or distributed setting where different sources of randomness are used. Pseudo-deterministic algorithms are a class of randomized search algorithms, which output a unique answer with high probability. Intuitively, they are indistinguishable from deterministic algorithms by an polynomial time observer of their input/output behavior. In this talk I will describe what is known about pseudo-deterministic algorithms in the sequential, sub-linear and parallel setting. We will also briefly describe an extension of pseudo-deterministic algorithms to interactive proofs for search problems where the verifier is guaranteed with high probability to output the same output on different executions, regardless of the prover strategies. Based on Joint work with Gat, Goldreich, Ron, Grossman and Holden.

Speaker Biography: Shafi Goldwasser is the RSA Professor of Electrical Engineering and Computer Science at MIT. She is also a professor of computer science and applied mathematics at the Weizmann Institute of Science in Israel. Goldwasser pioneering contributions include the introduction of interactive proofs, zero knowledge protocols, hardness of approximation proofs for combinatorial problems, and multi-party secure protocols. She was the recipient of the ACM Turing Award for 2012, the Gödel Prize in 1993 and another in 2001, the ACM Grace Murray Hopper award, the RSA award in mathematics, the ACM Athena award for women in computer science, the Benjamin Franklin Medal, and the IEEE Emanuel R. Piore award. She is a member of the AAAS, NAS and NAE.

Video Recording >>

October 31, 2017

We introduce a concept that generalizes several different notions of a “centerpoint” in the literature. We develop an oracle-based algorithm for convex mixed-integer optimization based on centerpoints. Further, we show that algorithms based on centerpoints are “best possible” in a certain sense. Motivated by this, we establish several structural results about this concept and provide efficient algorithms for computing these points. Our main motivation is to understand the complexity of oracle based convex mixed-integer optimization.

Speaker Biography: Amitabh Basu is an assistant professor in the Dept. of Applied Mathematics and Statistics at Johns Hopkins University since 2013. He was a visiting assistant professor in the Dept. of Mathematics at University of California, Davis, from 2010-2013. He obtained his Ph.D. in 2010 from Carnegie Mellon University with a thesis on mixed-integer optimization. He received an M.S. in Computer Science from Stony Brook University in 2006, and a bachelor’s in Computer Science from Indian Institute of Technology, Delhi in 2004.

Amitabh Basu is the recipient of the NSF CAREER award in 2015. He was one of the three finalists for the A.W. Tucker Prize 2012 awarded by the Mathematical Optimization Society. Basu serves as an Associate Editor for the journals Mathematics of Operations Research and Discrete Optimization. He also currently serves as Vice Chair for the Integer and Discrete Optimization cluster within the INFORMS Optimization Society.

Video Recording >>

November 9, 2017

Lifelong Learning encompasses computational methods that allow systems to learn in runtime and apply previous learning to new, unanticipated situations. As this sort of computation is found almost exclusively in nature, Lifelong Learning looks to nature for its underlying principles and mechanisms.

Biological type learning has not been demonstrated in extant computational systems. My DARPA L2M program is seeking to construct an operational Lifelong Learning system. We will discuss requirements for such a system, different computational concepts found in nature – including Super- Turing computation, stochastic and asynchronous communication, continual adaptivity, and interactive computation. While seemingly different, these varied computational attributes are, in fact, computationally equivalent, implying an underlying basis for biological learning, and computational Lifelong learning.

My BINDS lab has been studying neuroscience features and their translations to technology, including memory reconsolidation, oscillatory rhythms, and cognitively transparent interfaces for the last decade. We will discuss one of our recent findings: a property of the human brain connectome that leads to the capability of cognitive abstraction. We will describe our geometric method for massive data analysis, how we used it to parse tens of thousands of records of fMRI experiments, and the import of our results.

Speaker Biography: Dr. Siegelmann is a program manager at the MTO of DARPA, developing programs to advance the fields of Neural Networks and Machine Learning. She is on leave from the University of Massachusetts where she serves as the director of the Biologically Inspired Neural and Dynamical Systems (BINDS) Laboratory. A Professor of Computer Science and Core Member of the Neuroscience and Behavior Program, Siegelmann conducts cutting edge, interdisciplinary research in neural networks, machine learning, computational studies of the brain, intelligence and cognition, big data and industrial/biomedical applications.

Her research into neural processes has led to theoretical modeling and original algorithms capable of superior computation, and to more realistic, human-like intelligent systems. Siegelmann was named the 2016 Donald O. Hebb Award winner from the International Neural Network Society. Her work on an energy constrained brain activation paradigm with relation to performance and diet was awarded one of 16 Whitehouse sponsored 2015 BRAIN initiatives.

Siegelmann’s Super-Turing theory introduced a major alternative computational method. It became a sub-field of computation and the foundation of lifelong machine learning. Super-Turing opens a new way to interpret cognitive processes, as well as disease processes and their reversal. Her modeling of geometric neural clusters resulted in the highly utile and widely used Support Vector Clustering algorithm with Vladimir Vapnik and colleagues, specializing in the analysis of high-dimensional, big, complex data. Her neuroinformatics methods are used to identify overarching concepts of brain organization and function. A unifying theme underlying her research is the study of time and space dependent dynamical and complex systems. Her work is often interdisciplinary, combining complexity science, computational simulations, biological sciences and healthcare – focusing on better and more completely modeling human intelligence, and spanning medical, military and energy applications. Recent contributions include advanced human-machine interfaces that empower beyond human capabilities, dynamical studies of biological rhythm, and the study of brain structure that leads to abstract thought.

Dr. Siegelmann acts as a consultant internationally with industry, and in education. She remains very active in supporting young researchers and encouraging minorities and women to enter and advance in STEM.

Video Recording >>

November 16, 2017

Deep learning methods have received lots of attention in research on 3D object recognition. Due to the lack of training data, many researchers use pre-trained convolutional neural networks (CNNs) and either extract the output of one of the last layers as features or fine-tune the networks on their data. Due to the extraordinary features that can be obtained from a CNN, we intend to use them for 3D object recognition in the task of robotics using active learning and the Mondrian forest classifier. We achieve superior results with a method that fine-tunes a CNN before feature extraction for RGB data. Combined with extracted features from depth data and reducing the features’ dimensionalities, we improve the state-of-the-art accuracy on the University of Washington RGB-D Object dataset in the standards offline case, using a support vector machine (SVM). Instead of SVM as a classifier, we use active learning and the Mondrian forest, an online classifier, which can be updated over time once more data is available. Additionally, in our earlier work we present a novel combination of depth and color features to recognize different object categories in isolation. We also investigate the effect of domain change by training on RGB-D Object dataset and testing on DLR-RGB-D dataset. In our experiments we show that a domain change can have significant impact on the model’s accuracy, and present results for improving the results by increasing the variability of the objects in the training domain.

Speaker Biography: Dr.Haider Ali is currently serving as an Associate Research Scientist at The Center for Imaging Science (CIS), Johns Hopkins University. Before joining CIS he worked as a Senior Researcher at the Institute of Robotics and Mechatronics (RM), German Aerospace Center (DLR). His research is primarily focused on developing efficient 3D object and activity recognition methods for real time robotic applications. He received his Bachelor of Science in Computer Science from Bahauddin Zakariya University one of Pakistan’s major universities in 1998. After that he served several multinational IT companies in Pakistan as Software Engineer and Project Consultant until 2004. Thereafter he planned to pursue a master degree in software technology from Leuphana University of Lueneburg and graduated in 2006. He received his Ph.D. from Technical University of Vienna in 2010.

Video Recording >>

November 30, 2017

With recent “omics” technology advances (genomics and otherwise), it is becoming more important to identify the most effective approaches to care for patients by bringing together population datasets with the best research evidence. In order to decipher the meaning of data collected from various sources (e.g., electronic health records, sensory technologies), we will need to develop linkages between drug and disease mechanisms and influences of variation in genes involved in those mechanisms. Two challenges to leveraging current approaches to use such diverse data sources are in: deciphering the meaning and value of data from multiple sources; and designing effective approaches to deliver new evidence. Examples ways to address these challenges through the iterative evaluation of methods that combine data from multiple sources, and by enabling the replicability and reproducibility of genomic variant interpretations will be described.

Speaker Biography: Dr. Casey Overby Taylor is Assistant Professor in the Divisions of General Internal Medicine and Health Sciences Informatics in the Department of Medicine at the Johns Hopkins University. She is also a Fellow in the Johns Hopkins Malone Center for Engineering in Healthcare. Her research interests span a number of areas at the intersection of public health genomics and translational bioinformatics, including applications that support translation of genomic research to clinical and population-based healthcare settings and delivering health information to the public. Dr. Taylor is co-Chair of the electronic health record implementation workgroup on the NHGRI-funded Electronic Medical Records and Genomics (eMERGE) Network, and an investigator on the NCATS Biomedical Data Translator grant. She also received a grant from AHRQ to explore the usefulness of a model to implement genomic clinical decision support that engages stakeholders and uses open source decision support platforms.

Video Recording >>

Distinguished Lecturer

December 5, 2017

A growing disparity between supercomputer computation speeds and I/O rates makes it increasingly infeasible for applications to save all results for offline analysis. Instead, applications must analyze and reduce data online so as to output only those results needed to answer target scientific question(s). This change in focus complicates application and experiment design and introduces algorithmic, implementation, and programming model challenges that are unfamiliar to many scientists and that have major implications for the design of various elements of supercomputer systems. I review these challenges and describe methods and tools that various groups, including mine, are developing to enable experimental exploration of algorithmic, software, and system design alternatives.

Speaker Biography: Ian Foster is a Professor of Computer Science at the University of Chicago and a Senior Scientist and Distinguished Fellow at Argonne National Laboratory. Originally from New Zealand, he has lived in Chicago for longer than he likes to admit. Ian has a long record of research contributions in high-performance computing, distributed systems, and data-driven discovery. He has also led US and international projects that have produced widely used software systems and scientific computing infrastructures. He has published hundreds of scientific papers and eight books on these and other topics. Ian is an elected fellow of the American Association for the Advancement of Science, the Association for Computing Machinery, and the British Computer Society. His awards include the British Computer Society’s Lovelace Medal, the IEEE Tsutomu Kanai award, and honorary doctorates from CINVESTAV, Mexico, and the University of Canterbury, New Zealand.