Tools, Query Planning and Compilers for Manipulating and Processing Very Large Datasets

Joel Saltz, Johns Hopkins Medical Institutions

Applications that make use of very large scientific datasets have become an increasingly important subset of scientific applications. In these applications, the datasets are often multi-dimensional, i.e., data items are associated with points in a multi-dimensional attribute space. The processing is usually highly stylized, with the basic processing steps consisting of (1) retrieval of a subset of all available data in the input dataset via a range query, (2) projection of each input data item to one or more output data items, and (3) some form of aggregation of all the input data items that project to the each output data item. We have developed an infrastructure, called the OR-ELSE (Object Relational Extremely Large ScalE) Database Extender, that integrates storage, retrieval and processing of multi-dimensional datasets on scalable architectures. We will address query planning and execution strategies for range queries with user-defined processing. We will also describe operating system and algorithm support, derived from our work on Active Disks architectures, that will allow us to efficiently implement a broad class of decision support databases on inexpensive, highly scalable architectures.