Scientists are increasingly finding themselves in a paradoxical situation: on a never-ending quest to collect data, they are collecting more data than they can handle. This growth in data comes from three main areas: better instrumentation, improved simulations, and increased data sharing between scientists. The work in this talk describes a variety of techniques to support the exploration and analysis of large scientific datasets, and is drawn from experiences working with two different domain sciences: computational fluid dynamics and estuarine science.
First, we discuss the JHU Turbulence Database Cluster, an environment for the exploration of turbulent flows. We provide a web-service interface for accessing the complete space-time history of a Direct Numerical Simulation. This service gives researchers from around the world the tools needed for spatial and temporal exploration of the simulation. In this talk, we will discuss the overall system design and prototypical applications. We will also discuss the details of implementation, including hierarchical spatial indexing, cache-sensitive spatial scheduling of batch workloads, and localizing computation through data partitioning.
We will also discuss work to improve queries among multiple scientific data sets from the Chesapeake Bay as part of the CBEO project. We developed new data indexing and query processing tools that improve the efficiency of comparing, correlating, and joining data in non-convex regions. We use computational geometry techniques to automatically characterize space from which data are drawn, partition the region based on that characterization, and then create an index from the partitions. In the case of the Chesapeake Bay, our technique ensures that all data from a given tributary (i.e., the Potomac River) will be occupy contiguous regions of the index, which makes the data from these regions contiguous on disk.
Eric Perlman received a B.S. in Computer Engineering in 2002 from the University of California, Santa Cruz. He enrolled in the Computer Science Ph.D. program at Johns Hopkins University in 2003. He has worked on large distributed file systems during internships at both IBM Almaden Research Center in 2003 and Google in 2004.
At Johns Hopkins, Eric’s work primarily focused on improving access to large scientific data. He helped build the infrastructure for three interdisciplinary research projects: the JHU Turbulence Database Cluster, the Chesapeake Bay Environmental Observatory Testbed (CBEO:T), and the Open Connectome Project (OCP).
As of December 2012, Eric is working as a Bioinformatics Specialist at the Howard Hughes Medical Institute’s Janelia Farm Research Campus in Ashburn, VA. He is working with Dr. Davi Bock to build a processing pipeline for data captured using high-throughput electron microscopy.