# Reproducing the dataset from
"HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine Translation"

### Getting the HABlex Dictionaries

To reproduce the exact datasets used in the HABLex paper, you must first
obtain the Coppa V2  dataset from the World Intellectual Property Organization
(WIPO) (https://www.wipo.int/portal/en/index.html). As of November 2019, the
Coppa V2 dataset can be obtained free of charge for research purposes from the
WIPO by filling out this form:
https://www.wipo.int/patentscope/en/data/forms/products.jsp

Next, run:
./extract_data.py <COPPA_V2_DIR>
to populate the data directory,
passing in the directory containing the Coppa V2 dataset as the first
argument.

The resulting output data files should contain the following numbers of lines:

------------------------------
|          |  Dev   |  Test  |
|----------|--------|--------|
| Russian  |  9040  |  8001  |
| Korean   |  5593  |  5595  |
| Chinese  |  1773  |  2289  |
------------------------------
  
extract_data.py reproduces the exact data used; you may prefer to remove all
discontiguous annotations. To do so, you can simply add the
"--remove_discont" flag to the python command.

### Additional Notes

The following sections describe preprocessing and the output file format and
annotation.

We use a subset of CoppaV2 data, extracted from the moses/$LNG_en/ directory
(for $LNG in ko, ru, zh). The line_ids directory contains files with one line
ID (zero-indexed) per line for train, dev, and test:
line_ids/$LNG_en.{train,dev,test}.txt

We preprocessed the CoppaV2 data using langid
(https://github.com/saffsd/langid.py). We removed lines that had source side
language on the target side or vice versa. We took the last 6000 lines and
split those into test and dev. This resulted in the lines listed in the line
ID files.

### Annotation Indices

The annotations/ directory contains HABLex annotations as indices into the
CoppaV2 data. Each pickled file contains a dictionary for which the keys are
sentence IDs and the entries contain data.

### Data Format

The output of extract_data.py is a tsv (tab-separated variable) file, with the
following columns:

        
    [SENT_ID]       [$LNG PHRASE]   [ENG PHRASE]    [RARENESS NOTES]        [$LNG INDICES]  [ENG INDICES]   [$LNG DISCONTIGUOUS?]   [ENG DISCONTIGUOUS?]
    31343   фрагментов бамбука      bamboo fragments        NA(фрагментов):rare_wrt_I(бамбука)      [186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203]      [226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241]        False   False

In this example:

  * 31343 is the line in the Russian (and corresponding English) CoppaV2 files. 

  * фрагментов бамбука is the Russian phrase 

  * bamboo fragments is the English translation 

  * NA(фрагментов):rare_wrt_I(бамбука) indicates that the first word in the Russian phrase was not rare (NA), while the second was rare within the WIPO data (rare with respect to I=in-domain, vs. G=general domain, vs. GI=both); words in the phrase are separated by ":" (note that we calculated rareness on tokenized data, so these splits reflect our tokenization rather than the raw data) -- the format is [rareness1](token1):[rareness2](token2):...:[rareness-N](token-N) 

  * [186, 187, 188, 189, 190, 191, 192, 193, 194, 195, 196, 197, 198, 199, 200, 201, 202, 203] are the character indices of the source phrase 

  * [226, 227, 228, 229, 230, 231, 232, 233, 234, 235, 236, 237, 238, 239, 240, 241] are the character indices for the target phrase 

  * False indicates that the source (Russian) phrase is not discontiguous 

  * False indicates that the target (English) phrase was not discontiguous 

### Citation

If you use this resource, please cite:  
HABLex: Human Annotated Bilingual Lexicons for Experiments in Machine
Translation. Brian Thompson,* Rebecca Knowles,* Xuan Zhang,* Huda Khayrallah,
Kevin Duh and Philipp Koehn. EMNLP 2019.
