Modeling, Summarizing and Translating Speech

The maturing of speech recognition and corpus-based natural language processing has led to many practical applications in human-machine or human-human interactions utilizing both technologies. Speech processing is ultimately about detecting, finding and translating pertinent information from the spoken input rather than word by word transcription.

In this talk, I will give an overview of our research in the last seven years at HKUST combining both automatic speech recognition and natural language processing for spontaneous speech modeling, speech topic detection and summarization and speech translation. The main challenge of these tasks lies in discovering critical information from large amounts of unstructured, spontaneous, often accented, and multilingual speech. To this end, we propose that:

" A common acoustic model for speech recognition of multiple languages can be achieved by bootstrapping from a single language. " Spontaneous and accented speech recognition can be best achieved by differentiating between phonetic and acoustic changes. " Spontaneous and colloquial speech recognition can be made efficient by statistical learning of a spontaneous speech grammar. " The best context information for translation disambiguation in a mixed language query is the most salient trigger word. " Topic detection and summarization of multilingual, multimodal and multiple documents can be efficiently achieved by a unified segmental HMM framework. " Fixed-point front end processing, discrete HMMs, and unambiguous inversion transduction grammars provide the optimal performance and speed tradeoff for speech translation on portable devices.

I will also discuss our contributions in mining and collecting large amounts of speech and text data for the above research.

Speaker Biography

Pascale Fung received her PhD from Columbia University in 1997. She is one of the founding faculty members of the Human Language Technology Center (HLTC) at HKUST. She is the co-editor of the Special Issue on Learning in Speech and Language Technologies of Machine Learning Journal. She has been on the organizing committee of the Association of Computational Linguistics (ACL)’s SIGDAT, and served as area chair for ACL and chair for the Conference on Empirical Methods in Natural Language Processing (EMNLP), as well as co-chair of SemaNet 2002 at Coling. Pascale was the team leader for Pronunciation Modeling of Mandarin Casual Speech at the 2000 Johns Hopkins Summer Workshop. She has served as program committee member of numerous international conferences and technical publications. She is a Senior Member of the Institute of Electrical and Electronic Engineers (IEEE) and a Member of the Association of Computational Linguistics (ACL).

During her professional leave from 2000-2002, Pascale Fung co-founded and became the CTO and CEO of a Silicon Valley based multinational company specialized in developing and marketing speech and natural language solutions for internet and corporate customers. She was Member of Technical Staff and later Consultant at AT&T Bell Labs from 1993-1997. During 1991-1992, she was Associate Scientist at BBN Systems & Technologies (Cambridge, MA), participating in the design and implementation of the BYBLOS speech recognition system. She was a visiting researcher at LIMSI, Centre National de la Recherche Scientifique (France) in 1991, working on speaker adaptation and French speech recognition. From 1989-1991, she was a research student in the Department of Information Science, Kyoto University (Japan), working on Japanese phoneme recognition and speaker adaptation. Prior to that, she was an exchange student at Ecole Centrale Paris (France) working on speech recognition. A fluent speaker of six European and Asian languages, she has been particularly interested in multilingual speech and natural language issues.