Novel Inference, Training, and Decoding Methods over Translation Forests

Zhifei Li, Johns Hopkins University

A hypergraph or “packed forest” is a compact data structure that can represent exponentially many trees generated (for a given input sentence) by a tree-based machine translation system. This talk presents novel inference, training, and decoding methods that work with hypergraphs.

The hypergraph is weighted, with the weights (which are often negative log-probabilities) being learnt by a discriminative training method. One common drawback of such method is that it relies on the existence of high-quality supervised data (i.e., bilingual data), which may be expensive to obtain. We present two unsupervised discriminative training methods: minimum imputed-risk training, and contrastive language model estimation, both can exploit monolingual English data to perform discriminative training.

Decoding in this context refers to finding a translation that has a maximum posterior probability (i.e., MAP decoding) in the hypergraph. However, this is intractable due to spurious ambiguity, a situation where the probability of a translation string is split among many distinct derivations (e.g., trees or segmentations). We present a novel technique, variational decoding, which effectively approximates the intractable MAP decoding based on variational inference.

Finally, many atomic inference operations on hypergraphs such as finding the one-best or k-best trees, or computing expectations have wider applicability beyond machine translation. A general framework to achieve these operations is semiring-weighted logic programming. Within this framework, we first extend the expectation semiring, originally proposed for a finite state automaton, to a hypergraph. We then propose a novel second-order expectation semiring. These semirings can be used to efficiently compute a variety of frequently used expectations (e.g., entropy and its gradient) over the exponentially many trees in a hypergraph.

The modeling, decoding and algorithmic advances are presented in the context of a machine translation task, but we believe that they will also find applications in other areas of natural language processing (e.g., parsing and speech recognition) and more generally in pattern recognition.