Here are some code and data developed by my colleagues and me. Feel free to use them for research purposes!
- A collection of training scripts and recipes for the AWS Sockeye Neural Machine Translation toolkit [github]
Pareto Frontier Computation:
- Given a set of K-dimensional datapoints, finds the Pareto Frontier. [C++ & Python code on Github]
- This can be used for multi-objective optimization as in [ACL2012,NAACL2013], among others.
Neural Language Models infused with WordNet
- Train word embeddings with both distributional and relational semantics [ICLR2015].
- Currently supports Collobert/Weston NLM + WordNet training. Other extensions under way: [Python-Theano code on Github]
Travatar Tree-to-String Machine Translation System
- A C++ implementation of T2S decoder, with good results in e.g. Japanese-English translation [ACL2014].
- Mainly developed by Graham Neubig. Download here: [Project page]
RIBES Machine Translation Evaluation Metric:
- A rank-based metric suitable for evaluating translation of distant language pairs [EMNLP2010] (similar to LRScore)
- Python code: [Download page]
Multitarget TED Talks Task:
- A collection of multitarget bitexts based on TED Talk. [Download page]
- These 20-way parallel dev/test sets can support research in multitarget translation, multisource translation, pivot translation, and analysis of many typologically different languages
NAIST-NTT TED Talk Treebank:
- A manually annotated treebank of TED talks: [Download page]
- Version 1 contains trees for 1,217 sentences in Penn Treebank format. It consists of 10 Talks in English, with time alignments to speech and multilingual alignments to subtitles in 26+ languages.
Crosslingual Annotation of Chinese and English Discourse Connectives:
- An annotation of discourse connectives on 325 articles of the Chinese Treebank and their English translations.
- [Download page]
Fortune Cookie Corpus:
- A yummy corpus full of fortune cookie messages, collected when I was younger. [Download here]
- Description Paper: General Tsao, et. al., Automatic Generation of Fortune Cookie Messages, in The First and Last Workshop on Un-natural Language Processing, April 1, 2005.