Translation Discovery Using Diverse Similarity Measures

Charles Schafer

Automatic translation between natural languages using empirical learning methods is an active research area. Much of the work in this area is concentrated on acquiring models of translation from expensive resources, including large databases containing hundreds of thousands of translated sentences, which are available for only a small number of the language pairs between which one might wish to translate. This talk describes the automatic acquisition of translations into English for unknown foreign language words given the lack of any high-quality translation learning resources, which is the reality for nearly all of the world’s languages. The methods presented in this talk require minimal training supervision; the necessary inputs consist of only a few hundred thousand words of raw news text in a language of interest, such as Uzbek, and a translation dictionary into English for some related language, such as Turkish. We show how diverse measures of cross-language word similarity can be combined to learn translations for a considerable subset of the vocabulary of a language for which no translations are initially known. These similarities include statistical models of related-language cognate words, various cross-language comparisons of how words are used in monolingual text, and the distribution of terms in dated news articles. The ultimate result is a functional translation capability for new language pairs using minimal training resources and almost no human supervision.