Summer Research Expeditions (SRE) in Computational Sciences, Systems and Engineering: Project List

Please select your top 2 projects from the list below and list the project numbers in your application.


Project 1: Machine Learning for Personal Genomics
Your personal genome has hundreds of variants (places where your DNA is different from most people) that have been seen in only a few other individuals, or are even completely unique to you.  These “rare” variants help make you who you are - while some of them have no consequences, others contribute to how you look and even your predisposition to disease.  However, current methods are not yet good at accurately predicting the impact of rare variants.  Traditional association studies simply look for correlation between disease traits and common genetic variants that are present in many people - these studies can’t tell us anything about a variant that is unique to you.  We are building machine learning methods to help interpret the personal genome.  The REU student will help analyze and apply machine learning methods to datasets including hundreds of individual genomes.
Faculty Advisor: Dr. Alexis Battle


Project 2: Machine Learning and Translation of 500+ World Languages
Current human language processing systems tend to focus on fewer than 50 languages. We will seek to dramatically expand our ability to analyze and translate more than 500 others via massively multilingual bridge learning from parallel translations of the Bible, Wikipedia and other unique resources.
Faculty Advisor: Dr. David Yarowsky


Project 3: Deep Multi-view Representation Learning 
Often the success of a machine learning project depends on the choice of features used. Machine learning has made great progress in training classification, regression and recognition systems when "good" representations, or features, of input data are available. However, much human effort is spent on designing good features which are usually knowledge-based and engineered by domain experts over years of trial and error. A natural question to ask then is “Can we automate the learning of useful features from raw data?”. Representation learning techniques aim at discovering better representations of inputs by learning transformations of data that disentangle factors of variation in data while retaining most of the information. The success of such data-driven approaches to feature learning depends not only on how much data we can process but also on how well the features that we learn correlate with the underlying unknown labels (semantic content in the data). This project will focus on multi-view representation learning techniques for automatic feature learning where more than one view is available at the time of training and explore application to various tasks in speech and language processing as well as computational healthcare.  
Required Skills: Programming (C++, Python and/or Matlab), mathematical maturity 
Preferred: Machine learning and linear algebra coursework  
Faculty Advisor: Dr. Raman Arora


Project 4: Large-scale Kernel Methods 
Kernel methods provide a rich framework for nonlinear representation learning and have long been studied both theoretically and empirically. However, kernel methods have fallen out of favor recently owing to computational bottlenecks in scaling up kernel methods to large datasets - a naive implementation would require O(n^2) storage and O(n^3) computation where n is the size of the training set. This project will build on recent advances in scaling up kernel methods. We will investigate various scalable approaches including greedy basis selection, Nystrom's approximation and randomized projections as well as explore novel stochastic approximation algorithms. We will compare these scalable approaches with deep neural networks on various tasks in speech and language processing as well as computational healthcare.  
Required skills: Programming (C++, Python and/or Matlab), mathematical maturity, machine learning and linear algebra coursework  
Preferred: Hands-on experience with deep learning and kernel methods 
Faculty Advisor: Dr. Raman Arora


Project 5: Bayesian Optimization for Parameter-free Inference
In our past work [1,2], we have developed probabilistic models for personalizing the treatment and diagnosis for expensive and deadly chronic diseases. These models integrate different data sources --- measurements typically collected during routine visits --- to perform inference online about a patient's future disease course given their history. However, as is common in many practical applications of machine learning, these models often contain several "knobs" (hyperparameters) that govern model performance. This makes it challenging to apply these models to new diseases and populations rapidly. In recent years, Bayesian optimization has emerged as a technique for addressing this issue. In this project, you will learn about existing work in Bayesian Optimization. You will then have the opportunity to implement and adapt these techniques to our setting. You will also have the opportunity to study state-of-the-art models for personalization used in healthcare engineering.

[1] P.F. Schulam, F.M. Wigley, and S. Saria. Clustering longitudinal clinical marker trajectories from electronic health data: Applications to phenotyping and endotype discovery. In Twenty- Ninth AAAI Conference on Artificial Intelligence, 2015.
[2] P.F. Schulam and S. Saria. A framework for individualizing predictions of disease trajec- tories by exploiting multi-resolution structure. In Advances in Neural Information Processing Systems, pages 748–756.

Required Skills: Students should have taken one or more classes in probability and statistics, and ideally an introductory course in machine learning.
Faculty Advisor: Dr. Suchi Saria


Project 6: Deep Learning for Streaming Clinical Time Series
Many deadly conditions in the hospital can be prevented if doctors could act on it early enough. However, they currently do not have access to the right set of tools to do so. We have developed a preliminary system for analyzing the streaming time series routinely collected to monitor a patient's health. In this project, our goal is to employ recently developed deep learning techniques for sequential data to learn informative feature representations from these time series. The learned feature representations will be used for build early detection algorithms. During this summer, you will get the chance to implement and adapt deep learning techniques for time series data.
Required Skills: Introductory classes in statistics and machine learning and fluency in Python or another programming language.
Faculty Advisor: Dr. Suchi Saria


Project 7: Systems for Implementing Machine Learning Algorithms in Healthcare
Healthcare spending in the US is nearing $3 trillion per year, our mission is to develop statistical and computational tools that leverage electronic medical records (EMR) to tailor decision making in healthcare, with the ultimate goal of lowering costs and improve quality. An example is to provide real-time early warning score for deadly adverse conditions such as sepsis [1]. We are looking for students who are interested in healthcare data mining and machine learning, or who are excited to use state-of-the-art big data platform to scale up disease prediction systems to millions of patients. Students should have strong background in Python and SQL. Through this project, you will get experience working with detailed, large-scale clinical datasets and learn about machine learning techniques that are commonly employed in these datasets. From a data engineer's perspective, you will learn the process of developing a data-driven system (i.e., gethering and cleaning data, learning about the domain, implementing models, error analysis, and so on); meanwhile, you will be able to contribute to scaling up the system to support millions of patients.

[1] Henry, Katharine E., et al. "A targeted real-time early warning score (TREWScore) for septic shock." Science Translational Medicine 7.299 (2015): 299ra122-299ra122, 2015.

Required Skills: Students should have taken one or more classes in probability, statistics, or machine learning; If you also take some computer system courses, e.g., parallel computing, that will be a big plus.
Faculty Advisor: Dr. Suchi Saria


Project 8: Computational Biology Projects
NOTE: This project is only open to domestic applicants. High School students are eligible to apply to this project only. Please see the link below for more information.  
(Please refer to for information on stipend and benefits for this particular project listing. Special notes :- only 1 recommendation letter needs to be submitted if applicant is applying to ONLY project 8, otherwise 2 recommendation letters need to be submitted.) Computational Biology internships will provide you with hands-on research experience as part of ongoing research projects with bioinformatics and genomics faculty in the Departments of Biomedical Engineering, Computer Science, Biostatistics, and Biology, and the McKusick-Nathans Institute of Genetic Medicine. Possible projects include analysis of high-throughput DNA sequence data to characterize genes and their variations, studies of the human microbiome, assembly of whole-genome shotgun data from various species, and the development of new computational and statistical methods. If accepted to the program, you will be assigned a mentor who will determine your specific project. Past projects in Computational Biology can be found on the CCB internship page at
Faculty Advisors: Dr. Steven Salzberg, Jeff Leek, Kasper Hansen, Hongkai Ji, Dan Arking, Liliana Florea, James Taylor, Mihaela Pertea, Joel Bader, Loyal Goff, Ben Langmead, Andy McCallion

Project 9: Streaming Algorithms for Big Data
Mobile computing and the internet have driven down the cost of data acquisition. Simultaneously, cloud computing promises inexpensive and flexible storage. Engineers and scientists are finding data where none existed before. Our ability to generate and record data will quickly outpace our ability to query it effectively. The streaming model of computation was created to address these challenges. Essentially, a stream is a very long list of data, for example months of internet traffic at a router, measurements from a particle collider, or a corpus of corporate email. There is a need for network monitoring, fraud detection, scientific analysis, and other applications in streaming settings that can only be addressed with new algorithms. Massive data sets should be augmented with statistical surrogates that are easy to compute, require little storage, and can be queried quickly. A more efficient strategy is to store sketches, flexible summary statistics. Over the past twenty years researchers have created sketching algorithms for many statistics. These algorithms have even had impact beyond the streaming model, for example in algorithms for metric embedding’s and the Sparse Fast Fourier Transform. But, major open questions need to be addressed. Chief among them is: How does one find the optimal sketch for a given statistic?
Faculty Advisor: Dr. Vladimir Braverman


Project 10: Novel User Interfaces for Industrial Robots
Student will program Android apps to interact with the UR5 industrial robot. This app will allow users to train the robot by either sending commands, telling the robot to servo to new positions, recording waypoints, etc.
Faculty Advisor: Chris Paxton/Dr. Greg Hager

Project 11: Training Robots in Virtual Reality
Student will develop and improve a virtual reality environment for robot teaching using the Oculus Rift. They will build user interfaces for controlling and interacting with different robots, and integrate a physics simulation of the robot.
Faculty Advisor: Chris Paxton/Dr. Greg Hager


Project 12: Object Detection
Students will help implement efficient object detection to allow robots to interact with their world. In particular, students will implement a fast, parallelized version of a state-of-the-art object detection algorithm.
Faculty Advisor: Chris Paxton/Dr. Greg Hager


Computational Sensing and Medical Robotics REU Program