Summer Research Expeditions (SRE) in Computational Sciences, Systems and Engineering: Project List

Please select your top 2 projects from the list below and list the project numbers in your application.


Project 1: Computational Semantics and Question Answering

Prof. Van Durme and his lab pursue a variety of projects with a theme of either helping people get answers to their questions, or using people to answer questions about what something means to then help instruct computers to better understand language in the future. Projects are available in this space with either a linguistics, systems-development, or machine learning focus.

Faculty Advisor: Dr. Ben Van Durme


Project 2: Using Machine Learning for Information Extraction from Clinical Notes

Clinical notes contain rich unstructured information about a patient's condition during their hospital stay. Critical information like patient history, qualitative observations, and diagnoses may only be recorded in the clinical notes. However, the unstructured and highly technical nature of the notes makes this information hard to extract automatically. Moreover, most existing natural language processing frameworks are not adept at handling medical text. In this project we will use machine learning techniques to develop a pipeline for automatically extracting information from clinical notes that could later be used in building disease modeling systems. Good background in programming (at least one class in object oriented programming and/or familiarity with C++, Java or Python) is recommended. While useful, a medical background is not required. A class in natural language processing or experience in implementing machine learning algorithms (even as a hobby project) will be seen as a plus but is not required. Most of all, we want to work with someone who is a self-starter and is eager to learn and deploy machine learning algorithms.

Faculty Advisor: Dr. Suchi Saria


Project 3: Streaming and Sketching Algorithms for Big Data  

Mobile computing and the internet have driven down the cost of data acquisition. Simultaneously, cloud computing promises inexpensive and flexible storage. Engineers and scientists are finding data where none existed before. Our ability to generate and record data will quickly outpace our ability to query it effectively. The streaming model of computation was created to address these challenges. Essentially, a stream is a very long list of data, for example months of internet traffic at a router, measurements from a particle collider, or a corpus of corporate email. There is a need for network monitoring, fraud detection, scientific analysis, and other applications in streaming settings that can only be addressed with new algorithms. Massive data sets should be augmented with statistical surrogates (aka sketches) that are easy to compute, require little storage, and can be queried quickly. Over the past twenty years researchers have created streaming and sketching algorithms for many statistics. These algorithms have even had impact beyond the streaming model, for example in algorithms for metric embedding's and the Sparse Fast Fourier Transform. Students will have a chance to work on challenging algorithmic problems in the area of Big Data. Examples of possible projects include (1) low rank approximation and regression on very large matrices, (2) matching problems on massive graphs (3) streaming and sketching space complexity of norms and functions in high-dimensional space (3) applying chaining methods to frequency-based functions. Required Skills: A strong background in probability, linear algebra, analysis, and algorithms.   

Faculty Advisor: Dr. Vladimir Braverman


Project 4: Machine Learning in Natural Language Processing  

Recovering the structure of a sentence -- or the structure of an entire language -- involves simultaneously reasoning about many competing influences on the answer. To guide such reasoning, our lab builds probabilistic models of various linguistic phenomena. We develop methods for probabilistic inference in these models, and empirically investigate different approaches to training their parameters using various kinds of evidence. Various projects may be available depending on the student's interests and background. Desirable skills: Algorithm design (especially dynamic programming), discrete and continuous optimization, linguistics or NLP, prob/stats, ML topics such as graphical models and deep learning.  

Faculty Advisor: Dr. Jason Eisner


Project 5: Extensible Syntax for Programming Languages

In a specific software project, it is often convenient to define some concise and readable notation to express data or algorithms. Some programming languages offer a limited ability to define new notation within the language. This project will develop a powerful and flexible approach, in which the user can easily define new constructions that are added to the language. As tokenization and parsing proceed from left to right, conflicts between competing constructions will be resolved using general principles and user declarations. The ideas are a hybrid of natural language parsing and programming language parsing. Required skills: Strong software engineering skills, design instincts, familiarity with left-to-right parsing algorithms such as LR(k) and Earley's algorithm.

Faculty Advisor: Dr. Jason Eisner


Project 6: Sparse Causal Error Analysis

Data scientists often want to summarize important patterns of discrepancy: Where do a machine learning system's predictions fail to match the truth? Which demographics had different life expectancies in 2016 than in 2006? Why did Hillary Clinton fall short of 270 electoral votes? More deeply, one may wonder: What caused these discrepancies, or equivalently, what is a "simple" set of changes that would eliminate those discrepancies? Typically these questions are investigated in an ad hoc manner. We have developed an automatic, computationally intensive technique for identifying a sparse and coherent set of "interventions" that would change a causal system's output in a desired way. This summer project will test and refine the technique on a variety of real datasets, and develop software for end users to study their own datasets. Required skills: Ability to design and implement beautiful visualizations and user interfaces. Familiarity with mathematical modeling, continuous optimization, and/or machine learning. Interest in exploring large datasets.

Faculty Advisor: Dr. Jason Eisner


Project 7: Reinforcement Learning for Educational Technology

This project is building personalized educational technology for foreign language learning. Vygotsky (1934) postulated that a student learns from stimuli that are neither too easy nor too hard. We employ a model of the student's understanding and learning -- assuming that the student behaves roughly like an AI system with some unknown parameters. By challenging the student with new tasks, we figure out what they currently know, how they learn, and how we can therefore best challenge them in future to cause them to learn things that they don't yet know. This is an ongoing project with several theoretical and practical aspects, which could use various kinds of help from a summer research student. Required skills: Reinforcement learning, user interface design. Faculty Advisors: Dr. Jason Eisner, Dr. Philipp Koehn.

Faculty Advisor: Dr. Jason Eisner


Project 8: Stochastic Approximation for Subspace and Multiview Representation Learning

Unsupervised learning of useful features, or representations, is one of the most basic challenges of machine learning. Unsupervised representation learning techniques capitalize on unlabeled data which is often cheap and abundant and sometimes virtually unlimited. This project aims to develop new theory and methods for representation learning that can easily scale to large datasets. In particular, this project is concerned with methods for large-scale unsupervised feature learning, including Principal Component Analysis (PCA) and Partial Least Squares (PLS). To capitalize on massive amounts of unlabeled data, we will develop appropriate computational approaches and study them in the "data-laden" regime. Therefore, instead of viewing representation learning as dimensionality reduction techniques and focusing on an empirical objective on finite data, these methods are studied with the goal of optimizing a population objective based on sample. This view suggests using Stochastic Approximation approaches, such as Stochastic Gradient Descent (SGD) and Stochastic Mirror Descent, that are incremental in nature and process each new sample with a computationally cheap update. Furthermore, this view enables a rigorous analysis of benefits of stochastic approximation algorithms over traditional finite-data methods. The project aims to develop stochastic approximation approaches to PCA and PLS and related problems and extensions, including nonlinear, and sparse variants, and analyze these problems in the data-laden regime. Required skills: Programming (C++, Python and/or Matlab), mathematical maturity, machine learning and linear algebra coursework. Preferred: Hands-on experience with kernel methods

Faculty Advisor: Dr. Raman Arora

Project 9: Analysis of DNA and RNA Sequences, from Genes to Genomes to Microbiomes

NOTES: This project is only open to domestic applicants. A very limited number of high school students will be considered for this project. Please see the link below for more information. Only 1 recommendation letter is required for this project. This internship will provide you with hands-on research experience as part of the Salzberg lab in the Center for Computational Biology, located on the School of Medicine campus. Project A examines the use of high-throughput DNA and RNA sequencing of samples collected from patients suffering different types of infections. The goal is to develop computational methods that allow us to identify the cause of the infection, after searching through comprehensive genomic databases of bacteria, viruses, and other pathogens. Project B involves the assembly of large genomes using the latest DNA sequencing technology from Illumina, PacBio, and Oxford Nanopore. We have multiple plant and animal genome projects under way, and interns will help with the assembly and analysis of one of these genomes. Project C involves the detection and measurement of genes expressed in different tissues through direct RNA sequencing (known as RNA-seq). Interns will learn about our software for assembly and quantification of RNA-seq data and will help to improve the algorithms and test them on current data sets. Please refer to for more information about the lab, and to for information on past internships in the Center for Computational Biology. Required skills: Expertise in either the Python or Perl programming languages, and experience with the Unix operating system, either Linux or Mac.

Faculty Advisors: Dr. Steven Salzberg, Dr. Mihaela Pertea, and members of their labs.



Computational Sensing and Medical Robotics REU Program