Online Physical Design in Proxy Caches
Introduction      Applications, such as proxy caches and content distribution networks, serve a continuous stream of queries from tens of thousands of users in which a representative workload may not be available. As such, current research emphasizes the need for design tools that are always on and can continuously adapt the physical design to changes in the workload.

We study an online approach to vertical partitioning that detects incremental changes in the workload and adapts the physical design automatically. Our target application is physical design of the Bypass cache, a proxy database cache for the SkyQuery federation of Astronomy databases.
Technical Report
Architectural Components
Problem Description
Source Code
Related Projects

Technical Report      Abstract
Performance of proxy caches for database federations that serve a large number of users is crucially dependent on its physical design. Current techniques, automated or otherwise, for physical design depends on the identification of a representative workload and specification of a database. In proxy caches, however, such techniques are inadequate since workload characteristics and database contents change over time. This is remarkably shown at the proxy cache of SDSS, an Astronomy database, which receives a continuously evolving workload and in which cached objects are continuously loaded and evicted. We present novel techniques for automated physical design that adapt with the workload and balance the performance benefits of physical design decisions with the cost of implementing these decisions. These include online algorithms that respond effectively to drastic changes in the workload and efficient cost estimation modules that make these algorithms practical for deployment in real world caches. Our experiments within the proxy cache of the SDSS database demonstrate significant improvement in cache performance over offline techniques.

Download Report            PDF       PostScript

Return to top

Architectural Components     
Architectural components of the AdaptPD system

The online physical design tool is collocated with the proxy cache of the SDSS database. The proxy cache receives a sequence of queries. Ideally, given a new query the caching and the configuration decision will be simultaneous. However, we implement it by first sending the query to the AdaptPD module which along with the Cost Estimation module and the Configuration Manager determines if a transition is required. The query is then presented to the cache for execution. The cache may determine that new objects be loaded for executing the current query. If data objects are loaded, the specification of the new objects is sent to the Configuration Manager. The Configuration Manager builds up the new set of possible and relevant configurations to consider for the AdaptPD module. The Configuration Manager instructs the cache to perform the lowest cost cache transition as well as provides the start configuration for AdaptPD.

The Configuration Manager employs a workload-based heuristic to determine the set of N-configurations evaluated in AdaptPD An ad hoc partitioning of tables leads to an exponential number of configurations. Fortunately, Astronomy workloads are template-based and can be summarized compactly through query prototypes. A query prototype is defined as the set of attributes a query accesses such that queries with identical prototypes make up an equivalence class in the workload.

Candidate configurations are generated as new query prototypes appear in the workload sequence. Given a table T1(a,b,c,d,e) (a is the primary key) and a prototype P1 = (a,c), we generate a candidate configuration for P1 corresponding to tables T2(a,c) and T3(a,b,d,e). Thus, each candidate is optimized for the scan cost of a specific class of queries. We also merge non-overlapping prototypes to generate new candidates. This intuition comes from enumerationbased, offline vertical partitioning algorithms in which candidates are enumerated by coalescing groups of columns in existing configurations in a pairwise manner. Resulting candidates gradually reduce the expected total query execution cost for the workload.

Return to top

Problem Description     

Sample Queries
Top two most frequent query types during each week
Workload Evolution
Evolution across Data Releases and over time within the same Data Release

Return to top

Source Code      Workload Parser
We are currently building a customized SQL parser for AdaptPD. The grammer was constructed using ANTLR 2.7.5 ( and the driver was developed under Microsoft Visual Studio .Net in C# (Note that the parser is specifically tuned for query logs from the Sloan Digital Sky Survey (SDSS) database, so it may have trouble parsing arbitrary workloads).

Download Source            Source       Readme
(An older version of the SQL parser, developed using libraries from the Open SkyQuery project, is also available. The major benefit being that it can output in XML.    Download    Source   Readme)

Sloan Digital Sky Survey (SDSS) Workload
We are currently using workloads extracted from query logs on the SDSS database. A representative query trace from the DR4 version of the database are included.

Download Trace            DR4 (1.4 million queries)

Open SkyQuery portal
A modified version of the Open SkyQuery portal with a semi-functional proxy cache that operates on SDSS at table granularity is available. The caching module is included under "CacheTools" and designed as a proof-of-concept. It demonstrates how bypass decsions are made in the SkyQuery framework but does not actually load and evict multi-gigabyte tables. The distribution is developed as a .Net Web Application and require Internet Information Services (IIS), SQL Server, and .Net framework to function.

Download Portal            Source       Readme

Return to top

Related Projects      Preliminary Works
Self Managing Database workshop (SMDB'08)
Download Paper            PDF       PostScript

New England Database Day conference (NEDBDay'08).
Download Poster            PDF       PostScript

Database Systems for Advanced Applications conference (DASFAA'07).
Download Paper            PDF       PostScript

SkyQuery is a web-based database application for the World Wide Telescope (WWT). In SkyQuery, clients submit queries through an applet-based query interface, which arrives at the Open SkyQuery portal, the mediation middle-ware of WWT. The mediator then divides requests into sub-queries for each site in the database federation. For details on SkyQuery, please refer to the following links:

SkyQuery Project:    
Open SkyQuery Portal:

Bypass Caching
The bypass-yield cache framework into the WWT for making query bypass decisions. A bypass-yield cache is altruistic by nature and designed to minimize network traffic in a database federation. For details on bypass caching, please refer to the following paper:

Download Paper            PDF       PostScript

Return to top


Return to top