To reproduce the exact datasets used in the HABLex paper, you must first obtain the Coppa V2 dataset from the World Intellectual Property Organization (WIPO). As of November 2019, the Coppa V2 dataset can be obtained free of charge for research purposes from the WIPO by filling out this form.
After you have the WIPO data, download the HABLEX annotations and scripts here.
Next, run ./extract_data.py <COPPA_V2_DIR> to populate the data directory, passing in the directory containing the Coppa V2 dataset as the first argument.
The resulting output data files should contain the following numbers of lines:
Dev |
Test |
|
---|---|---|
Russian |
9040 |
8001 |
Korean |
5593 |
5595 |
Chinese |
1773 |
2289 |
extract_data.py reproduces the exact data used; you may prefer to remove all discontiguous annotations. To do so, you can simply add the "--remove_discont" flag to the python command.
The following sections describe preprocessing and the output file format and annotation.
We use a subset of CoppaV2 data, extracted from the moses/$LNG_en/ directory (for $LNG in ko, ru, zh). The line_ids directory contains files with one line ID (zero-indexed) per line for train, dev, and test: line_ids/$LNG_en.{train,dev,test}.txt
We preprocessed the CoppaV2 data using langid. We removed lines that had source side language on the target side or vice versa. We took the last 6000 lines and split those into test and dev. This resulted in the lines listed in the line ID files.
The annotations/ directory contains HABLex annotations as indices into the CoppaV2 data. Each pickled file contains a dictionary for which the keys are sentence IDs and the entries contain data.
The output of extract_data.py is a tsv (tab-separated variable) file, with the following columns:
[SENT_ID] [$LNG PHRASE] [ENG PHRASE] [RARENESS NOTES] [$LNG INDICES] [ENG INDICES] [$LNG DISCONTIGUOUS?] [ENG DISCONTIGUOUS?] 31343 фрагментов бамбука bamboo fragments NA(фрагментов):rare_wrt_I(бамбука) [186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203] [226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241] False False
In this example:
31343 is the line in the Russian (and corresponding English) CoppaV2 files.
фрагментов бамбука is the Russian phrase
bamboo fragments is the English translation
NA(фрагментов):rare_wrt_I(бамбука) indicates that the first word in the Russian phrase was not rare (NA), while the second was rare within the WIPO data (rare with respect to I=in-domain, vs. G=general domain, vs. GI=both); words in the phrase are separated by ":" (note that we calculated rareness on tokenized data, so these splits reflect our tokenization rather than the raw data) -- the format is [rareness1](token1):[rareness2](token2):...:[rareness-N](token-N)
[186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203] are the character indices of the source phrase
[226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241] are the character indices for the target phrase
False indicates that the source (Russian) phrase is not discontiguous
False indicates that the target (English) phrase was not discontiguous
If you use this resource, please cite:
HABLex: Human
Annotated Bilingual Lexicons for Experiments in Machine Translation.
Brian Thompson,* Rebecca Knowles,* Xuan Zhang,* Huda Khayrallah, Kevin
Duh and Philipp Koehn. EMNLP 2019.