Statistical Machine Translation

A course at ESSLLI 2010 taught by Adam Lopez.

Machine translation is the automatic translation of human languages, such as English and Chinese. Statistical machine translation refers to a collection of techniques in which machine translation systems automatically learn how to translate by examining a large corpus of human translations. Statistical learning methods make it possible to build a translation system for a new language pair very quickly, even without deep knowledge of both languages. These techniques are popular: most research papers on machine translation presented in major natural language processing conferences focus on statistical methods. They are also effective: Google translate is able to offer translation for thousands of language pairs due to the use of statistical methods.

This course will provide a thorough introduction to statistical machine translation. We will describe all aspects of building a statistical machine translation system, from both formal and practical perspectives. Topics include translation modeling, rule induction and parameter learning, search algorithms, engineering techniques, and evaluation of systems. Within each of these areas we will cover a variety of alternatives, from the mainstream to the novel, explaining current state of the art and identifying the open questions that are the topic of current research.

Lectures

My philosophy is that slides are visual aids for lectures, so the following slides are best understood along with the accompanying talk. While I don't know of any videos of the ESSLLI lectures, there is a video of a shorter, one-day tutorial that I've given on machine translation. Feel free to use any of this material as you like. I'd appreciate an acknowledgement if you do!
  • Day One: an introduction to statistical machine translation: how can machines learn to translate? The associated exercise was developed by Kevin Knight in a really great introductory paper on statistical machine translation.
  • Day Two: probabilistic modeling, language models, and finite-state translation models based on words.
  • Day Three: finite-state translation models based on phrases, and decoding of finite-state models.
  • Day Four: context-free models and their decoding algorithms, and unsupervised learning of translation models.
  • Day Five: evaluation and supervised learning of translation models.

Where to go for more information

    Introductory material on machine translation:
  • Statistical Machine Translation by Adam Lopez. This is a forty-page self-contained survey of the field of statistical machine translation that was published in ACM Computing Surveys. Much of the course material is based on the survey, and you can find more formal explanations of many of the course concepts there.
  • For even more detail, Statistical Machine Translation is an introductory textbook written by Philipp Koehn. Anyone with a serious interest in machine translation should have a copy of it.
  • A Statistical MT Tutorial Workbook by Kevin Knight is a great, informal introduction for beginners. You might consider working through this exercise before moving on to the above texts.
  • Philipp Koehn and Chris Callison-Burch previously taught machine translation at ESSLLI in 2005: Day 1, 2, 3, 4, 5.
  • statmt.org is a website with pointers to various statistical machine translation resources.
  • On the non-technical side, John Hutchins has written several interesting papers about the history of machine translation and its uses.
There are many relevant tutorials on the fundamental techniques at the heart of most statistical machine translation systems. Here are few that I've found useful.

Build your own translation system

Each of these packages is actively maintained and has a significant userbase. For datasets, see statmt.org.
  • cdec: an elegant, very fast decoder for a wide variety of structured learning problems, including phrase-based and SCFG-based translation.
  • Moses: a widely-used, complete toolkit containing implementations of phrase-based and SCFG-based translation models.
  • Joshua: an open-source Java implementation of SCFG-based translation.
  • SRI-LM: a widely-used language modeling toolkit with many features.
  • Giza++: an implementation of the expectation-maximization algorithm for the IBM and HMM models, used for word alignment.
  • Berkeley Aligner: a robust Java implementation of several innovative alignment algorithms.