------------------------------------------------------------------------ AutoParse: Workload Parser For The DynaPart Project Johns Hopkins University Description: This parser is designed to work with queries collected from the SDSS log for both the SkyServer V3 and DR1 workloads. Also included are several tools used to facilitate experiments conducted using the workload. ------------------------------------------------------------------------ CONTENTS 1. File/Directory Organization 2. Installation and Compilation 3. Running AutoParse 4. Specification of Output Formats 5. Configuration Files 6. Documentation 7. Notes on Limitations 8. Contact Information ------------------------------------------------------------------------ ------------------------------------------------------------------------ 1. File/Directory Organization ------------------------------------------------------------------------ README - this readme AutoParse.sln - Visual Studio project solutions file AutoParse.suo - Visual Studio project user options AutoParse.ndoc - NDoc project file bin/ - executable doc/ - documentation etc/ - configuration files and metadata src/ - source code tmp/ - deleted project files and other ------------------------------------------------------------------------ 2. Installation and Compilation ------------------------------------------------------------------------ AutoParse is written in C# as a Visual Studio .Net project. To compile the code, you must have Visual Studio .Net or greater installed on your local machine. To execute the program, you will also be required to install the latest .Net framework (http://msdn.microsoft.com/netframework/downloads/framework1_1/) Note that the AutoParse tool was developed on a Windows XP machine with Service Pack 1 and .Net Framework 1.1. - Compiling AutoParse Open AutoParse.sln solutions file as a Visual Studio .Net project. Select "Build Solution" to compile the executable. Copy the executable and supporting files (AutoParse.exe and antlr.runtime.dll located in bin/) into your working directory. - Compiling sql grammer Downloaded the latest ANTLR executable for C# from http://www.antlr.org/ (AutoParse was developed with ANTLR 2.7.5). Add antlr.exe to your path and from src/sql/, run: antlr sql.g where sql.g is the grammer file (run "antlr -traceParser sql.g" to debug the output). Make changes to sql.g and recompile as needed. ------------------------------------------------------------------------ 3. Running AutoParse ------------------------------------------------------------------------ AutoParse reads queries from standard input and writes any parse errors and summary statistics to standard out. Thus, unless you are debugging, it is ideal to use input redirection to supply the query input. sample command line: C:\[your path]> AutoParse -bypass < query.log > console.out Usage: AutoParse -[debug | bypass | decision | diffbypass | parsable | ylderr | randsel | cnt | cntqry | singletbl | funcmeta | elaptim | elaptimpr | optrows | optio] [-numselect <# queries>] [-numtotal <# total>] [-tbl ] [-if1 ] [-if2 ] [-h] ( e.g. AutoParse -bypass < query.log ) Options: -debug: debug mode, dump parsed tokens for each query. -bypass: write query in bypass cache usable format. -decision: write query in decision tree usable format. -diffbypass: write differences between two query files in bypass cache usable format (see -if1, if2). -if1: used with -diffbypass option, specify input file #1 in . -if2: used with -diffbypass option, specify input file #2 in . -parsable: write only unmodified queries that can be parsed by the current parser. -cnt: rewrite input queries as aggregate queries - count_big. -cntqry: similar to output with -cnt, but each query is followed by the unmodified version. -elaptim: parse elapsed time data from the query analyzer output. -elaptimpr: similar to the output with -elaptim, but time are summed in pairs (useful for -cntqry). -optrows: parse estimated rows data from the query analyzer output. -optio: parse IO estimates from the query analyzer output. -ylderr: write only unmodified queries with questionable yield (>5000 rows and divisible by 1000). -randsel: randomly select a set number of queries from the workload (see -numselect, -numtotal). -numselect: used with -ransel option, specify number of queries to select in <# queries>. -numtotal: used with -ransel option, specify total queries in the workload in <# total>. -singletbl: write unmodified queries that reference only a single table. -tbl: used with -singletbl, specifies the single table to select from in -funcmeta: write user function table and column dependencies to file specified in the config file. -h: print usage instructions. Option Details: -debug: dumps all tokens parsed to standard output (see output format below) * Ex: AutoParse -debug < query.log * * Input: select ra from photoobj * Output: * result file - "select ra from photoobj" * yield file - * count file - "select count_big(*) from photoobj" -bypass: parse query, output bypass cache format (see below) * Ex: AutoParse -bypass < query.log * * Input: select ra from photoobj * Output: * result file - "1,0,PHOTOOBJ:RA(S),$$" * yield file - * count file - -decision: parse query, output decision tree format (see below) * Ex: AutoParse -decision < query.log * * Input: select ra from photoobj where ra > 30 * Output: * result file - "from:from * table:photoobj:photoobj * where:where * clause:a24.331,9,c30 * ************ * QryCnt:1" * yield file - * count file - -diffbypass: write the output of result of comparing parsed results from two files * Ex: AutoParse -diffbypass -if1 file1 -if2 file2 < query.log * * Input: * query.log - select ra from photoobj * file1 - 1,0,PHOTOOBJ:RA(S),$$ * file2 - 1,1,PHOTOOBJ:RA(W),$$ * Output: * result file - "Begin comparing parsed files: * #1 mismatch: select ra from photoobj * >1,0,PHOTOOBJ:RA(S),$$ * <1,1,PHOTOOBJ:RA(W),$$ * * total number mismatch: 1, aggregate mismatch: 0" * yield file - * count file - -parsable: output queries that can be parsed * Ex: AutoParse -parsable < query.log * * Input: select ra from photoobj * Output: * result file - "select ra from photoobj" * yield file - * count file - -cnt: rewrite as count query * Ex: AutoParse -cnt < query.log * * Input: select ra from photoobj * Output: * result file - "select ra from photoobj" * yield file - * count file - "select count_big(*) from photoobj" -cntqry: rewrite as count query followed by the original query * Ex: AutoParse -cntqry < query.log * * Input: select ra from photoobj * Output: * result file - "select ra from photoobj" * yield file - * count file - "select count_big(*) from photoobj * select ra from photoobj" -elaptim: parse cpu and estimated query execution time from the query analyzer output (note: first two entries are ignored) * Ex: AutoParse -elaptim < elaptime.out * * Input: "SQL Server Execution Times: * CPU time = 47 ms, elapsed time = 87 ms." * Output: * result file - "47,87" * yield file - * count file - -elaptimpr: same as -elaptim but entries are grouped by pairs where the output is the sum of a pair of estimated times (note: first two entries are ignored) * Ex: AutoParse -elaptimpr < elaptime.out * * Input: "SQL Server Execution Times: * CPU time = 0 ms, elapsed time = 62 ms. * * SQL Server Execution Times: * CPU time = 0 ms, elapsed time = 3 ms." * Output: * result file - "0,65" * yield file - * count file - -optrows: parse estimated rows from query analyzer output (note: $ delimited input) * Ex: AutoParse -optrows < optrows.out * * Input: * Output: * result file - "" * yield file - * count file - -optio: parse estimated IO from query analyzer output (note: $ delimited input) * Ex: AutoParse -optio < optio.out * * Input: * Output: * result file - "" * yield file - * count file - -ylderr: output queries with questionable yield * Ex: AutoParse -ylderr < query.log * * Input: * Output: * result file - * yield file - * count file - -randsel: randomly select a subset of queries from workload where -numselect denote the size of the subset and -numtotal denote the size of the workload to select from. * Ex: AutoParse -randsel -numselect 100 -numtotal 20000 < query.log * * Input: <20000 queries> * Output: * result file - <100 randomly selected unmodified queries> * yield file - <100 corresponding row counts> * count file - -singletbl: output queries that reference only a single table * Ex: AutoParse -singletbl -tbl photoobj < query.log * * Input: "select ra from photoobj" * Output: * result file - "select ra from photoobj" * yield file - * count file - -funcmeta: output user function table and column dependencies (see output format below) * Ex: AutoParse -funcmeta * * Input: * Output: * table dependency file * column dependency file ------------------------------------------------------------------------ 4. Specification of Output Formats ------------------------------------------------------------------------ - Bypass Cache Format The primary output format of parsed queries containing a list of attributes referenced or returned by the query. * * Format specification: * * ,,,,...,,$$ * * : Number of rows returned by query (read from corresponding yield file) * : 1 if aggregate query, 0 if not (aggregate queries reference * aggregate functions MIN/MAX/AVG/SUM/COUNT) * : column or function attribute. Attributes may be of the following form * * : name of the column prefixed by the corresponding * table separated by ":". "(S)" indicate that the column is part of * the query yield (aka, returned in the select clause). * : "(W)" denote that the column is used/referenced * by the query but not returned in the result, hence not part of the yield. * Column may appear in either the where, order by, group by, or having clauses, * or alternatively, it is a parameter to an aggregate function. * : "(F)" denote that the column is used by a user-defined * function or is passed as a parameter to a user-defined function. * : user-defined function that is referenced * outside of the "from" clause. * * : similar to except * it is prefixed by the user-defined function that returned the * corresponding column. * * : similar to except * it is prefixed by the user-defined function. * * Rules (star [*] attributes are resolved prior to parsing): * * ATTRIBUTES FROM USER TABLES: * 1) All columns in the select clause are matched against the corresponding table * in the from clause and a corresponding item "" is appended * to the parsed result. Duplicate columns are ignored. * 2) All columns in the where, order by, group by, and having clauses, or in the parameter list to an * aggregate function are matched against the corresponding table in the from clauses and * a corresponding item "" is appended to the parsed result. However, * if a matching attribute of type "(S)" or "(W)" was already appended, then the attribute is ignored. * * ATTRIBUTES FROM USER-DEFINED FUNCTIONS: * 1) User-defined functions in the select, where, order by, group by, or having clauses * is appended to the parsed result in the format "". Duplicate * functions are ignored. * 2) All columns passed as parameter to user-defined function are matched against the corresponding * table in the from clause and a corresponding item "" is appended * to the parsed result. If a duplicate column of type "(S)", "(W)", or "(F)" was already appended, * then the column is ignored. * 3) All columns used by a user-defined function are matched against the corresponding user table and * a corresponding item "" is appended to the parsed result. If duplicate * column of type "(S)", "(W)", or "(F)" was already appended, then the column is ignored. * 4) For all columns in the select clause that are matched against a corresponding user-defined function * in the from clause, a corresponding item "" is appended * to the parsed result. Duplicate columns are ignored. * 5) For all columns in the where, order by, group by, and having clauses, or are in the parameter list to an * aggregate function that are matched against a corresponding user-defined function in the from clauses, * a corresponding item "" is appended to the parsed result. However, * if a matching attribute of type "(S)" or "(W)" was already appended, then the attribute is ignored. * * PRECEDENCE ORDERING FOR RESOLVING DUPLICATE COLUMNS: * When appending to the attribute list, the following precedence hold. Namely, when duplicate attributes are * referenced by a query, only the attribute type with the highest precedence is appended to the list. * Type "(S)" > Type "(W)" > Type "(F)" - duplicate attributes are defined as columns with matching source table/user-defined function and attribute name * - Decision Tree Format * * Format specification: * * from:from * table:: * ftable:: * fparam: * where:where * clause:,, * QryCnt:<# queries> * * : table or function alias (or name if no alias) * : table name * : function name * : function parameter list, in unique id * : left operand of predicate, in unique id * : predicate operator, in unique id * : right operand of predicate, in unique id * <# queries> : total queries in workload * - Debug Output Format * * Format specification: * * : * ... * : * COUNT: * * : Name of token parsed token * : Token type (ie. table, column, function, etc.) * * : re-written count query * - Elapsed Time Output Format * * Format specification: * * : * * Note that the first two elapsed time entries from the query analyzer * output are ignored. * - Estimated Rows/IO Output Format * * Format specification: * * * * Note that for IO, relative IO is the sum all non-zero estimated I/O * values from the query's optimizer output. * - Yield Output Format Corresponding yield values for parsed queries are written to the yield output file: * * Format specification: * * * * : Number of rows for the corresponding query * - Function Table/Column Dependency Output * * Format specification: * * TABLE * , * * COLUMN * ,, * ------------------------------------------------------------------------ 5. Configuration Files ------------------------------------------------------------------------ The following files can be found under the etc/ directory and are required to run AutoParse. The "config" document must be located in the same path as the AutoParse tool. config: Configure startup parameters for AutoParse. Specify the path for all metadata files and output files. metadata/ tables_bestdr1: Table metadata for DR1 database. Include row/column count and byte width information. tables_v3: Table metadata for SkyServer V3 database. columns_bestdr1: Column metadata for DR1 database. Include byte width information. columns_v3: Column metadata for SkyServer V3 database. dr2func: Function metadata for DR2 database. Include table, column, and function dependencies. trace_rows_dr1: Yield (row count) for DR1 workload included with the parser. trace_rows_v3: Yield (row count) for SkyServer V3 workload included with the parser. ------------------------------------------------------------------------ 6. Documentation ------------------------------------------------------------------------ Documentation for the project is generated via Microsoft Visual Studio's XML Documentation feature (see doc/AutoParse.xml). We then used the NDoc add-on for Visual Studio to convert the documentation into html (MSDN format). Refer to doc/index.html for the full documentation. NDoc is an open source add-on for generating documentation for C# (http://ndoc.sourceforge.net/). ------------------------------------------------------------------------ 7. Notes on Limitations ------------------------------------------------------------------------ The parser was not designed to be an complete and exhaustive sql parser. Rather, it is designed to correctly parse the DR1 and V3 workload included with AutoParse. Therefore, queries using complex nested structures or built-in functions that were not encountered in our workload may fail to get parsed, but it should work reasonably well overall as our test set is quite large. Furthermore, the parser is closely linked to the metadata files. For example, if an attribute is referenced that cannot be linked to a source table based on the metadata, the query will fail to get parsed. Regarding the attribute listing output format for parsed results, we decided to use the format because it satisfied our experimental needs. If more information be required, it should be straight-forward to modify the code to utilize unused tokens. However, because binary/uniary operators and keywords are disgarded, it will not be possible to get those tokens unless changes are made in the grammer. - Preprocessing Certain queries from our workload had syntax problems that require minor preprocessing steps to ready the query for parsing (mostly, separating keywords from surrounding literals with white space ). - Queries Ignored Certains type of queries from the DR1 workload were ignored either because there were problems with the syntax or they were irrelevant to our experiments. - Access to sql query log * Ex: select * from sdssad2.weblog.dbo.sqllog where statement like * '%jpeg%' and yy=2004 - Lacking 'from' clause * Ex: select ' select ed_string_rhl' as ' select ed_string_rhl', * dbo.fphototype('star') - Syntax issues * Ex: select top 10 * from sdssad6.bestdr1.dbo.region * * Ex: select g.objid, g.u,g.g,g.r,g.i,g.z,g.err_u,g.err_g,g.err_r,g.err_i,g.err_z, * g,extinction_u,g,extinction_g,g,extinction_r,g,extinction_i,g,extinction_z, * pz.z as photoz, pz.zerr as photozerr,pz.t, pz.terr from galaxy g, photoz pz * where pz.objid = g.objidandra>175.4 and ra < 176 and dec>-2 and dec<-1.45 * (Operator chaining, syntax error in query analyzer: * "... pz.objid = g.objidandra>175.4 ..."") * * Ex: select top 10 * from fcoords from eq(256,15) * * Ex: select top 10 p.objid, p.ra, p.dec, - System tables * Ex: select ' select ed_string_rhl' as ' select ed_string_rhl', xtype from sysobjects * where name like 'phototag' - Server stat * Ex: select host_name(),db_name(), @@servername, host_id() - Unrecognized table source (#x, #upload) * Ex: select u.*, p.objid, 'http://skyserver.sdss.org/dr1/en/tools/explore/obj.asp?id=' + * cast(p.objid as varchar(20)) as url, str(p.ra,9,5) as ra, str(p.dec,9,5) as dec, * dbo.fphototypen(p.type) as type, p.status, str(p.u,6,2) as u, str(p.g,6,2) as g, * str(p.r,6,2) as r, str(p.i,6,2) as i, str(p.z,6,2) as z from #x x, #upload u, * photoprimary p where u.up_id = x.up_id and x.objid=p.objid order by x.up_id - Column referenced (run) is not an attribute of the source table (region) * Ex: select area from region where (run=745) ------------------------------------------------------------------------ 8. Contact Information ------------------------------------------------------------------------