Picture of Matt Post
Matt Post
post @ cs . jhu . edu

I am a research scientist at the Johns Hopkins Human Language Technology Center of Excellence.

My main research interests are machine translation, syntax, parsing, and language modeling.

I maintain the Joshua decoder.

My publications, code and datasets, miscellaneous personal stuff.

Last spring, I taught a class on machine translation with Chris Callison-Burch and Adam Lopez.


We released a set of parallel corpora between English and six languagse from the Indian subcontinent, which you can download here.
Map of India with three languages highlighted Map of India with three more languages highlighted

I wrote a JQuery stack decoder to help visualize word-based MT for MT class. You can play with the live online demo or get the code on github.

You can find data (including the grammar) and code for extracting TSG feature sets on Github. This data includes a version of Mark Johnson's exhaustive CKY parser modified to parse with grammars containing rules intermingled terminals and nonterminals and with a number of other convenient command-line options.

The code for the experiments in our 2009 paper on inferring tree substitution grammars is available on github. It is small, modular, and well-documented, and despite being written in Perl, I have been told that it is easy to understand. It includes a patch to Mark Johnson's CKY parser that allows it to be used with TSGs.
Picture of a parse tree with TSG annotations

Charniak and Johnson's reranking code (from their 2005 ACL paper) extracts a large set of syntactic features from parse trees. An impediment to extracting their features is that it's integrated into their reranking framework, requiring fairly specialized file formats. I modified their extract-spfeatures program to enable the extraction of their feature set from a single parse tree in standard bracketed format, e.g.,
$ echo "(S (NP (DT The) (NN child)) (VP (VBD demurred)))" | extract-spfeatures
        
It is available on github.