Statistical Machine Translation
A course at NASSLLI 2012 taught by Adam Lopez.
Google translate can instantly translate between any pair of over fifty human languages (for instance, from French to English). How does it do that? Why does it make the errors that it does? And how can you build something better? Modern translation systems learn how to translate by reading millions of words of already translated text, and this course will show you how they work. Despite demonstrable success over the last decade, much work remains to be done, so we will also identify open questions at the heart of current research, as well as computational and linguistic insights that may help solve them. The course covers a diverse set of fundamental building blocks from linguistics, machine learning, algorithms, data structures, and formal language theory, along with their application to a real and difficult problem in artificial intelligence.
Lectures
- Day One: Statistical Machine Translation
- Day Two: Learning Probabilistic Translation Models
- Day Three: Learning (continued), Prediction, and Phrase Modeling
- Day Four: Context-Free Translation Modeling
- Day Five: Evaluation and Discriminative Learning
Exercises
Where to go for more information
-
Introductory material on machine translation:
- Statistical Machine Translation by Adam Lopez. This is a forty-page self-contained survey of the field of statistical machine translation that was published in ACM Computing Surveys. Much of the course material is based on the survey, and you can find more formal explanations of many of the course concepts there.
- For even more detail, Statistical Machine Translation is an introductory textbook written by Philipp Koehn. Anyone with a serious interest in machine translation should have a copy of it.
- Machine Translation class at Johns Hopkins University contains a large number of links and resources.
- A Statistical MT Tutorial Workbook by Kevin Knight is a great, informal introduction for beginners. You might consider working through this exercise before moving on to the above texts.
- Philipp Koehn and Chris Callison-Burch previously taught machine translation at ESSLLI in 2005: Day 1, 2, 3, 4, 5.
- statmt.org is a website with pointers to various statistical machine translation resources.
- The MT Archive holds many, if not most, modern research papers on machine translation.
- On the non-technical side, John Hutchins has written several interesting papers about the history of machine translation and its uses.
- Mehryar Mohri's tutorial on weighted finite-state automata, which underlie many translation models.
- David Chiang's gentle tutorial on synchronous grammars, which underlie many other models (slides). I used a few of his examples in my slides.
- Liang Huang's tutorial on dynamic programming.
- Jeff Bilmes wrote a nice tutorial on the Expectation Maximization algorithm.
Software
State-of-the-art translation algorithms are implemented in a number of open-source projects. The most popular of these are listed below. They are all actively maintained and have significant userbases.- Joshua: a translation toolkit for syntax-based translation, developed at Johns Hopkins (Java).
- Moses: a widely-used toolkit implementing most major translation algorithms (C++).
- cdec: a fast decoder for a variety of translation models (C++).
- KenLM: a fast language-modeling toolkit, can be used with the above systems (C++).
- SRI-LM: a widely-used language modeling toolkit with many features, used with the above systems (C++).
- Giza++: a widely-used word alignment toolkit, originally developed at a Johns Hopkins summer workshop (C++).
- Berkeley Aligner: a robust Java implementation of several innovative alignment algorithms (Java).
Data
Modern machine translation systems work by learning from large amounts of data. Many datasets are freely available. By feeding one of these datasets into one of the software toolkits above, you can build your own a machine translation in as little as a few hours or days.- Machine Translation workshop 2011 shared task data, used in research evaluations (French-English, Spanish-English, Czech-English, Haitian Creole-English).
- JRC-Acquis, legislative text of the European Union (22 European languages).
- Europarl, proceedings of the European Parliament (22 European languages).
- Canadian Hansards, proceedings of the Canadian Parliament (French and English).
