Reproducing the dataset from "HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation"

Getting the HABlex Dictionaries

To reproduce the exact datasets used in the HABLex paper, you must first obtain the Coppa V2 dataset from the World Intellectual Property Organization (WIPO). As of November 2019, the Coppa V2 dataset can be obtained free of charge for research purposes from the WIPO by filling out this form.

After you have the WIPO data, download the HABLEX annotations and scripts here.

Next, run ./extract_data.py <COPPA_V2_DIR> to populate the data directory, passing in the directory containing the Coppa V2 dataset as the first argument.

The resulting output data files should contain the following numbers of lines:

Dev

Test

Russian

9040

8001

Korean

5593

5595

Chinese

1773

2289

extract_data.py reproduces the exact data used; you may prefer to remove all discontiguous annotations. To do so, you can simply add the "--remove_discont" flag to the python command.

Additional Notes

The following sections describe preprocessing and the output file format and annotation.

We use a subset of CoppaV2 data, extracted from the moses/$LNG_en/ directory (for $LNG in ko, ru, zh). The line_ids directory contains files with one line ID (zero-indexed) per line for train, dev, and test: line_ids/$LNG_en.{train,dev,test}.txt

We preprocessed the CoppaV2 data using langid. We removed lines that had source side language on the target side or vice versa. We took the last 6000 lines and split those into test and dev. This resulted in the lines listed in the line ID files.

Annotation Indices

The annotations/ directory contains HABLex annotations as indices into the CoppaV2 data. Each pickled file contains a dictionary for which the keys are sentence IDs and the entries contain data.

Data Format

The output of extract_data.py is a tsv (tab-separated variable) file, with the following columns:

[SENT_ID]       [$LNG PHRASE]   [ENG PHRASE]    [RARENESS NOTES]        [$LNG INDICES]  [ENG INDICES]   [$LNG DISCONTIGUOUS?]   [ENG DISCONTIGUOUS?]
31343   фрагментов бамбука      bamboo fragments        NA(фрагментов):rare_wrt_I(бамбука)      [186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203]      [226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241]        False   False

In this example:

Citation

If you use this resource, please cite:
HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation. Brian Thompson,* Rebecca Knowles,* Xuan Zhang,* Huda Khayrallah, Kevin Duh and Philipp Koehn. EMNLP 2019.