Final Examination: Saturday, May 10, 9-12 AM, Shaffer 100 Exam is closed book/closed note. However, students may bring a single, double sided piece of 8.5x11 inch paper with anything written on it that you wish. This may include notes and formulas of any kind, and the act of preparing the sheet is helpful in the review process. Topics for Exam: --------------- PAT trees, suffix arrays Inverted files (creation and use), indexing Signature files Goals and methods of document representation compression in IR Boolean IR models (including extensions to basic Boolean models) Vector-based IR models in detail including term weighting, similarity measures, ... Bayesian IR models (Inquery system, Naive Bayes, hierarchical Bayes) Evaluation metrics precision, recall, F-measure, normalized recall, accuracy methods for computation, P_25, P_50, interpolation issues, understanding of issues and challenges in IR evaluation Query expansion vs. Term clustering Clustering algorithms, including Salton's greedy method, hierarchical agglomerative clustering (including algorithm details such as minimal/maximal/average linkage variants, dendograms, etc.) SVD (singular value decomposition)/LSI (Latent semantic indexing) Relevance Feedback (and sources of obtaining it) Roccio algorithm and its variants User and group modelling, other features for relevance classification besides term overlap Document routing/filtering/topic-classification Information Extraction - named entity recognition/tagging person/place classification, sense tagging - including algorithm comparison and understanding of relation to IR algorithms Expectation Maximization (EM) algorithm (e.g. for person/place classification) Information visualization - Dotplot (and uses for text segmentation, detection of version differences and repetition) HTTP protocols in *detail* (including HTTP/1.0 and HTTP/1.1) SOIF headers, their motivation and potential uses Web robot libraries and techniques, robot exclusion protocols, queuing strategies (know HW4 in detail) Harvest architecture in detail (including Gather, broker system, caching and replication subsystems) Hierarchy of web agents (from blind web crawlers through intelligent shopping bots) collection fusion, search-engine merger (e.g. Metacrawler) including detailed analysis of the issues, scale normalization collaborative filtering PageRank algorithm and link analysis approaches future directions and visions