Statistical Machine Translation

A course at NASSLLI 2012 taught by Adam Lopez.

Google translate can instantly translate between any pair of over fifty human languages (for instance, from French to English). How does it do that? Why does it make the errors that it does? And how can you build something better? Modern translation systems learn how to translate by reading millions of words of already translated text, and this course will show you how they work. Despite demonstrable success over the last decade, much work remains to be done, so we will also identify open questions at the heart of current research, as well as computational and linguistic insights that may help solve them. The course covers a diverse set of fundamental building blocks from linguistics, machine learning, algorithms, data structures, and formal language theory, along with their application to a real and difficult problem in artificial intelligence.

Lectures

Exercises

Where to go for more information

There are many relevant tutorials on the fundamental techniques at the heart of most statistical machine translation systems. Here are few that I've found useful.

Software

State-of-the-art translation algorithms are implemented in a number of open-source projects. The most popular of these are listed below. They are all actively maintained and have significant userbases.
  • Joshua: a translation toolkit for syntax-based translation, developed at Johns Hopkins (Java).
  • Moses: a widely-used toolkit implementing most major translation algorithms (C++).
  • cdec: a fast decoder for a variety of translation models (C++).
  • KenLM: a fast language-modeling toolkit, can be used with the above systems (C++).
  • SRI-LM: a widely-used language modeling toolkit with many features, used with the above systems (C++).
  • Giza++: a widely-used word alignment toolkit, originally developed at a Johns Hopkins summer workshop (C++).
  • Berkeley Aligner: a robust Java implementation of several innovative alignment algorithms (Java).

Data

Modern machine translation systems work by learning from large amounts of data. Many datasets are freely available. By feeding one of these datasets into one of the software toolkits above, you can build your own a machine translation in as little as a few hours or days.