Supporting Data Analysis in Large Scale Scientific Databases

Xiaodan Wang, Johns Hopkins University

Continued improvements in physical instruments and data pipelines has lead to an exponential growth in data size in the Sciences. In Astronomy for example, the Panoramic Survey Telescope and Rapid Response System (Pan-STARRS) produces tens of terabytes daily. In turn, the Scientific community distributes data geographically at a global scale to facilitate the accumulation of data at multiple, autonomous sources and relies on application-driven optimizations to manage and share repositories at a petabyte scale. More importantly, workloads that comb through vast amounts of data are gaining importance in the Sciences. These workloads consist of “needle in a haystack” queries that are long running and data intensive, requiring non-indexed scans of multi-terabyte tables, so that query throughput limits performance. Queries also join data from geographically distributed sources such that transmitting data produce tremendous strains on the network. Thus, query scheduling needs to be re-examined to overcome scalability barriers and enable a large community of users to explore simultaneously the resulting, massive amounts of Scientific data. Toward this goal, we study algorithms that incorporate network structure in scheduling for distributed join queries. Resulting schedules exploit excess network capacity and minimize the utilization of network resources over multiple queries. We also put forth a data-driven, batch processing paradigm that improves throughput in highly contentious databases by identifying partial overlap in the data accessed by incoming queries. Instrumenting our algorithms in Astronomy and Turbulence databases provide significant reductions in both network and I/O costs. Moreover, similar resource scheduling problems exists in Cloud computing, and we extend our algorithms to large scale data processing systems such as Hadoop, Cassandra, and BigTable.