Word Alignment Annotation

Goals: Machine translation systems are built by automatically extracting patterns from a database of sentence-translation pairs. Sometimes, it is challenging to do so without the help of a bilingual dictionary. For example, suppose we have these pairs of Chinese sentences and their English translation:

wo xihuan pingguo - I like apples
wo xihuan caomei he xiangjiao - I like strawberries and bananas
ta bu xihuan juzi - He doesn't like oranges

From Sentences 1 and 2 we can infer that "wo" in Chinese means "I" in English because these two words always seem to occur in the same sentence-translation pairs. Similarly, since "xihuan" and "like" always seem to appear in the same pairs, we can infer that they are translations. And then by process of elimination, we can assume "pingguo" translates to "apples" in Sentence 1. Once the machine learns these translation equivalents, then it can translate new sentences. This is the main way in which machine translation systems are developed these days.

However, there are cases like "caomei" and "xiangjiao" in Sentence 2 where it is not clear which one refers to "strawberries" and which one refers to "bananas". The database doesn't have enough examples to help us find the pattern. In such case, it would be helpful to have a bilingual dictionary. The goal of this annotation project is to produce word alignments on some standard databases of sentence-translation pairs. These vetted word alignments will then help us construct a trustworthy bilingual dictionary, which we can then use to support our machine translation research.

To get started:

Annotation instructions: here
Google Drive for wa files: here
If you have questions/suggestions, please email Kevin Duh (kevinduh at cs.jhu.edu) and Rebecca Knowles (rknowle2 at jhu.edu) anytime.