Final Examination: Thursday May 11th, 2023 6-9 PM EDT - Hackerman B-17 (NOT Krieger, which was last year's room) (the official exam day/time for the course's timeslot) The exam will be given in person and students are allowed 1 double-sided 8.5x11 sheet of paper with notes on both sides. Otherwise, closed book and note. Students are explicitly forbidden to access or in any way review/utilize any prior course examinations from either this course or other courses. The 2022 exam will be a mixture of entirely new questions as well as those used at some point in past years. Student performance on each type of question will be aggregated, and students performing significantly better on previously given vs. entirely new questions relative to the rest of the class will be considered as evidence of violation of this policy. Questions will be a mixture of multiple choice, short answer, problem solving, compare-contrast, pros-cons comparison, design, and advanced analysis. Material on the exam will be based on content and topics covered in any of the slide decks for the course (including the 2 textbook authors' slide decks from Stanford and Munich on the textbook website linked to from the course homepage, as well as supplemental slides on the course homepage). Topics solely covered in the textbook and not on any slide deck or in class will generally not be on the exam, although the textbook should be consulted for additional detail and explanations for topics which are on the course slidedecks. Finally, additional details, information and points made orally during any class by the instructor are fair game for inclusion on the exam, and students are encouraged to review all lecture recordings both this additional oral information and for explanations of the content on the slides (for example, the lectures on DotPlot information visualization on irwa-2021-03-25.mp4 2nd half and irwa-2021-04-01.mp4 first half or EM on irwa-2021-03-16.mp4, where the slides themselves don't provide all explanatory detail, or presentations such as irwa-2021-04-15.mp4 and later classes covering additional observations, although these are just an example subset). Topics for Exam: --------------- Boolean IR models (including extensions to basic Boolean models) Goals and methods of document representation compression in IR PAT trees, suffix arrays Inverted files (creation and use), indexing Signature files Vector-based IR models in detail including term weighting, similarity measures, ... Bayesian IR models Evaluation metrics precision, recall, F-measure, normalized recall, accuracy methods for computation, P_25, P_50, interpolation issues, understanding of issues and challenges in IR evaluation Query expansion vs. Term clustering Clustering algorithms, including Salton's greedy method, k-means clustering, hierarchical agglomerative clustering (including algorithm details such as minimal/maximal/average linkage variants, dendograms, etc.) SVD (singular value decomposition)/LSI (Latent semantic indexing) Relevance Feedback (and sources of obtaining it) Roccio algorithm and its variants, KNN User and group modelling, other features for relevance classification besides term overlap Document routing/filtering/topic-classification Information Extraction - named entity recognition/tagging person/place classification, sense tagging - including algorithm comparison and understanding of relation to IR algorithms Expectation Maximization (EM) algorithm (e.g. for person/place classification) Information visualization - Dotplot and Hearst's TileBars (and uses for text segmentation, detection of version differences and repetition) HTTP protocols in *detail* (including HTTP/1.0 and HTTP/1.1) SOIF headers, their motivation and potential uses Web robot libraries and techniques, robot exclusion protocols, queuing strategies (know HW4 in detail) Harvest architecture in detail (including Gather, broker system, caching and replication subsystems) Hierarchy of web agents (from blind web crawlers through intelligent shopping bots) collection fusion, search-engine merger (e.g. Metacrawler) including detailed analysis of the issues, scale normalization collaborative filtering and recommender systems PageRank algorithm and link analysis approaches Hubs & Authorities model, HITS Large Language Models in IR future directions and visions