Efficient evaluation of data-intensive batch-queries in open simulation laboratories

Better instruments, faster and bigger supercomputers and easier collaboration and sharing of data in the sciences have introduced the need to manage increasingly large datasets. Advances in high-performance computing (HPC) have empowered many science disciplines’ computational branches. However, many scientists lack access to HPC facilities or the necessary sophistication to develop and run HPC codes. The benefits of testing new theories and experimenting with large numerical simulations have thus been restricted to a few top users. In this dissertation, I describe the ``remote immersive analysis" approach to computational science and present new techniques and methods for the efficient evaluation of scientific analysis tasks in analysis cluster environments.

I will discuss several techniques developed for the efficient evaluation of data-intensive batch-queries in large numerical simulation databases. An I/O streaming method for the evaluation of decomposable kernel computations utilizes partial-sums to evaluate a batch-query by performing a single sequential pass over the data. Spatial filtering computations, which use a box filter, share not only data, but also computation and can be evaluated over an intermediate summed volumes dataset derived from the original data. This is more efficient for certain workloads even when the intermediate dataset is computed dynamically. Threshold queries have immense data requirements and potentially operate over entire time-steps of the simulation. An efficient and scalable data-parallel approach evaluates threshold queries of fields derived from the raw simulation data and stores their results in an application-aware semantic cache for fast subsequent retrieval. Finally, synchronization at a mediator, task-parallel and data-parallel approaches for the evaluation of particle tracking queries are compared and examined.

These techniques are developed, deployed and evaluated in the Johns Hopkins Turbulence Databases (JHTDB), an open simulation laboratory for turbulence research. The JHTDB stores the output of world-class numerical simulations of turbulence and provides public access to and means to explore their complete space-time history. The techniques discussed implement core scientific analysis routines and significantly increase the utility of the service. Additionally, they improve the performance of these routines by up-to an order of magnitude or more when compared with direct implementations or implementations adapted from the simulation code.

Speaker Biography

Kalin Kanov was born in Dimitrovgrad, Bulgaria on October 11th, 1982. He graduated with distinction from the University of Virginia in 2006 with a B.A. degree in Astronomy and Physics and a minor in Computer Science. He was awarded the Limber award, given to the most outstanding Astronomy graduate of the class of 2006. After graduating from UVA he interned at NASA’s Goddard Space Flight Center and worked at Perrin Quarles Associates on the development of the Emissions Collection and Monitoring Plan System for the U.S. EPA Clean Air Markets Division.

Kalin enrolled in the Computer Science Ph.D. program at Johns Hopkins University in 2008. His research has focused on the development of methods for the efficient evaluation of batch-queries for large numerical simulation datasets. During internships at Los Alamos National Laboratory and Google he worked on large scientific database systems and evaluation techniques for complex arithmetic expressions over dataset features partitioned into columns.