######################################################
# EXAMPLE MODELS for MACHINE TRANSLATION
#####################################################

This is a machine translation model trained in AWS Sockeye. 

To run it, follow instructions on "sockeye-recipes," which is a wrapper toolkit that enables easy sharing and re-implemenation of Sockeye models:  https://github.com/kevinduh/sockeye-recipes/

Assuming you have sockeye-recipes installed, these models should contain all the needed parameters to decode some test data. You only need to modify one file "hyperparams.txt" as follows:

* Set modeldir to the location of the directory of the unpacked downloaded model (this directory).
* Set rootdir to your copy of sockeye-recipes
* Finally, you should be able to translate by running, e.g.:
  qsub -S /bin/bash -V -cwd -q gpu.q -l gpu=1,h_rt=00:30:00 -j y path-to-sockeye-recipes/scripts/translate.sh -p path-to-this-dir/hyperparams.txt -i input.txt -o output.txt -e sockeye_gpu

input.txt is the file to be translated, and output.txt will be the resulting translation. 

====
Note that input.txt assumes tokenized data. Refer to files example_{raw,tokenized}_input.txt for examples. The tokenization instructions are:

General tokenizers:
1) Install Joshua: https://cwiki.apache.org/confluence/display/JOSHUA/ and set $JOSHUA to its base directory
2) Install Moses: https://github.com/moses-smt/mosesdecoder/ and set $MOSES to its base directory

Tokenize Arabic:
1) Install PyArabic: https://pypi.org/project/PyArabic/
2) Run $JOSHUA/scripts/preparation/normalize.pl < $rawtext > $tmpfile
3) Run PyArabic3 on $tmpfile, fllowed by $MOSES/scripts/tokenizer/normalize-punctuation.perl -l ar | $JOSHUA/scripts/preparation/tokenize.pl

Tokenize Korean:
1) Install Mecab-Ko: https://bitbucket.org/eunjeon/mecab-ko/
2) Run $JOSHUA/scripts/preparation/normalize.pl < $rawtext, then Mecab

Tokenize Russian:
1) cat $rawtext | $JOSHUA/scripts/preparation/normalize.pl ru | $JOSHUA/scripts/preparation/tokenize.pl -l ru | $JOSHUA/scripts/preparation/lowercase.pl 

Tokenize Chinese:
1) Install Jieba: https://github.com/fxsjy/jieba
2) Run $JOSHUA/scripts/preparation/normalize.pl zh < $rawtext | $MOSES/scripts/tokenizer/remove-non-printing-char.perl > $tmpfile
3) Then run jieba on the resulting tmpfile with mode cut_all=False
