Optimizing Information Extraction over Evolving Text

Fei Chen, University of Wisconsin-Madison

Information extraction (IE) programs automatically extract structured data that are embedded in text corpora. These extracted structured data can then be exploited by all kinds of applications such as structure queries and data mining. As such, IE is becoming increasingly important in managing text corpora. Most current IE solutions have considered only static text corpora, over which we typically apply IE programs only once. However, many real-world text corpora such as Wikipedia are evolving: documents can be added, deleted and modified. Therefore, to keep extracted information up to date, we often must apply IE programs repeatedly, to every corpus snapshot. How can we execute such repeated IE efficiently?

In this talk, I will present efficient solutions for IE over evolving text. The underlying idea of these solutions is to recycle previous IE results, given that consecutive corpus snapshots often contain much overlapping text. I will first discuss Cyclex, a system that recycles for single IE programs. Cyclex models a small set of important properties shared by many practical IE programs to guarantee the correctness of recycling. Furthermore, it can efficiently recycle for large-scale text corpora by exploiting several database techniques such as cost-based optimization and a join-like recycling algorithm. Then I will talk about Delex, which builds on Cyclex and recycles for complex IE workflows that consist of multiple IE programs. Finally, I will conclude with future research directions in deploying and managing IE systems over large-scale text corpora and developing user-friendly, robust and scalable analysis tools for scientific researchers.

Speaker Biography

Fei Chen is a doctoral candidate in the Database group at the University of Wisconsin-Madison. Her dissertation develops solutions for efficient information extraction over large-scale evolving text. Her research interests include managing large-scale text corpora, distributed computing and mining biology sequences.