Multi-document Statistical Fact Extraction and Fusion

Gideon Mann

Each year more than 240 terabytes of information in printed material is produced (Lyman and Varian, 2003), far more than humans have the capacity to absorb. While ad-hoc document retrieval (e.g. Web search engines) has sped access to these text collections, information needs are often for information at levels of granularity smaller than an entire document. Information extraction has been proposed as a solution for returning textual information at the granularity of a single fact or relationship, but application of these methods has been limited by the need for extensive manual annotation of training data. In addition, research on information extraction has focused on extraction from single documents in isolation without regard to the entire corpus context.

This talk proposes the use of minimally supervised fact extraction from multiple documents as a enabling component for high-precision information retrieval. Fact extractors (Phrase-Conditional Likelihood, Naive Bayes, and Conditional Random Fields) are trained from a small set of example facts and found text on the web. The trained systems are then used to extract facts from documents retrieved from the web, and then fusion methods (Viterbi Frequency, Label Marginal Fusion) pick correct facts from a set of proposed candidates. The performance and utility of these fact extraction and fusion methods is empirically evaluated on four tasks: biographic fact extraction, management succession timeline construction, cross-document coreference resolution, andontology acquisition for question answering.