Radu Florian's Talk

Title: Transformation Based Learning and Corpus Based Lexical Disambiguation: Syntactic and Semantic Ambiguity Resolution

Resolving the ubiquitous ambiguity found in human language is a central task in the natural language processing field. Whether the ambiguity is syntactic (e.g. a word having different part-of-speech functions) or semantic (e.g. a word having different senses), the disambiguation process is a central first step in most language processing tasks, such as machine translation, information retrieval or question answering.

This talk presents a broad survey of statistical machine learning techniques for multilingual lexical syntactic and semantic disambiguation. It focuses on several original and empirically successful data-driven machine learning algorithms in the transformation-based learning framework, and presents effective models for classifier combination and minimally supervised learning.

The target tasks investigated in the talk include multilingual named entity recognition, part-of-speech tagging, text chunking, word segmentation and word-sense disambiguation, over a diverse set of languages: Basque, Chinese, Czech, Dutch, English, Estonian, Italian, Spanish, and Swedish. The demonstrated results obtained highly competitive performance in international system bake-offs: first place in 6 out of 7 tested languages (out of 36 participating systems) in the SENSEVAL2 word sense disambiguation evaluation, and second place (out of 12 systems) in the CoNLL'02 shared task on multilingual named entity recognition.

Last modified: Sep 16 2002