HOW-TO GUIDE: Installing and running the Joshua Decoderby Chris Callison-Burch (Released: June 12, 2009)This document gives instructions on how to install and use the Joshua decoder. Joshua is an open-source decoder for parsing-based machine translation. Joshua uses the synchronous context free grammar (SCFG) formalism in its approach to statistical machine translation, and the software implements the algorithms that underly the approach. These instructions will tell you how to:
If you use Joshua in your work, please cite this paper: Zhifei Li, Chris Callison-Burch, Chris Dyer, Juri Ganitkevitch, Sanjeev Khudanpur, Lane Schwartz, Wren Thornton, Jonathan Weese and Omar Zaidan, 2009. Joshua: An Open Source Toolkit for Parsing-based Machine Translation. In Proceedings of the Workshop on Statistical Machine Translation (WMT09). [pdf] [bib] These instructions apply to Release 1.3 of Joshua, which is described in our WMT09 paper. You can also get the latest version of the Joshua software from the repository with the command: svn checkout https://joshua.svn.sf.net/svnroot/joshua/trunk joshua Step 1: Install the softwarePrerequisitesThe Joshua decoder is written in Java. You'll need to install a few software development tools before you install it:
Before installing these, you can check whether they're already on your system by typing In addition to these software development tools, you will also need to download:
After you have downloaded the srilm tar file, type the following commands to install it: mkdir srilm mv srilm.tgz srilm/ cd srilm/ tar xfz srilm.tgz make If the build fails, please follow the instructions in SRILM's INSTALL file. For instance, if SRILM's Makefile does not identify that your're running a 64 bit Linux you might have to run "make MACHINE_TYPE=i686-m64 World". After you successfully compile SRILM, Joshua will need to know what directory it is in. You can type export SRILM="/path/to/srilm" Where "/path/to/srilm" is replaced with your path. You'll also need to set a export JAVA_HOME="/Library/Java/Home" These variables will need to be set every time you use Joshua, so it's useful to add them to your .bashrc, .bash_profile or .profile file. Download and Install JoshuaFirst, download the Joshua release 1.3. tar file. Next, type the following commands to untar the file and compile the Java classes: tar xfz joshua.tar.gz cd joshua ant Running For the examples in this document, you will need to set a export JOSHUA="/path/to/joshua/trunk" Run the example model
To test to make sure that the decoder is installed properly, we'll translate 5 sentences using a small translation model that loads quickly. The sentences that we will translate are contained in |
|
科学家 为 攸关 初期 失智症 的 染色体 完成 定序 ( 法新社 巴黎 二日 电 ) 国际 间 的 一 群 科学家 表示 , 他们 已 为 人类 第十四 对 染色体 完成 定序 , 这 对 染色体 与 许多 疾病 有关 , 包括 三十几 岁 者 可能 罹患 的 初期 阿耳滋海默氏症 。 这 是 到 目前 为止 完成 定序 的 第四 对 染色体 , 它 由 八千七百多万 对 去氧 核糖核酸 ( dna ) 组成 。 英国 自然 科学 周刊 发表 的 这 项 研究 显示 , 第十四 对 染色体 排序 由 一千零五十 个 基因 和 基因 片段 构成 。 基因 科学家 的 目标 是 , 提供 诊断 工具 以 发现 致病 的 缺陷 基因 , 终而 提供 可 阻止 这些 基因 产生 障碍 的 疗法 。 |
|
The small translation grammar contains 15,939 rules -- you can get the count of the number of rules by running |
[X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists to [X,2] ||| 2.17609119 0.333095818 1.53173875 [X] ||| [X,1] 科学家 [X,2] ||| [X,2] of the [X,1] scientists ||| 2.47712135 0.333095818 2.17681264 [X] ||| [X,1] 科学家 [X,2] ||| [X,2] of [X,1] scientists ||| 2.47712135 0.333095818 1.13837981 [X] ||| [X,1] 科学家 [X,2] ||| [X,2] [X,1] scientists ||| 2.47712135 0.333095818 0.218843221 [X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists [X,2] ||| 1.01472330 0.333095818 0.218843221 [X] ||| [X,1] 科学家 [X,2] ||| [X,2] of scientists of [X,1] ||| 2.47712135 0.333095818 2.05791640 [X] ||| [X,1] 科学家 [X,2] ||| scientists [X,1] for [X,2] ||| 2.47712135 0.333095818 2.05956721 [X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientist [X,2] ||| 1.63202321 0.303409695 0.977472364 [X] ||| [X,1] 科学家 [X,2] ||| [X,1] scientists , [X,2] ||| 2.47712135 0.333095818 1.68990576 [X] ||| [X,1] 科学家 [X,2] ||| scientists [X,2] [X,1] ||| 2.47712135 0.333095818 0.218843221 |
|
The different parts of the rules are separated by the
You can use the grammar to translate the test set by running java -Xmx1g -cp $JOSHUA/bin \ -Djava.library.path=$JOSHUA/lib \ -Dfile.encoding=utf8 joshua.decoder.JoshuaDecoder \ example/example.config.srilm \ example/example.test.in \ example/example.nbest.srilm.out For those of you who aren't very familiar with Java, the arguments are the following:
You can inspect the output file by typing |
0 ||| scientists to vital early 失智症 the chromosome completed has ||| -127.759 -6.353 -11.577 -5.325 -3.909 ||| -135.267 0 ||| scientists for vital early 失智症 the chromosome completed has ||| -128.239 -6.419 -11.179 -5.390 -3.909 ||| -135.556 0 ||| scientists to related early 失智症 the chromosome completed has ||| -126.942 -6.450 -12.716 -5.764 -3.909 ||| -135.670 0 ||| scientists to vital early 失智症 the chromosomes completed has ||| -128.354 -6.353 -11.396 -5.305 -3.909 ||| -135.714 0 ||| scientists to death early 失智症 the chromosome completed has ||| -127.879 -6.575 -11.845 -5.287 -3.909 ||| -135.803 0 ||| scientists as vital early 失智症 the chromosome completed has ||| -128.537 -6.000 -11.384 -5.828 -3.909 ||| -135.820 0 ||| scientists for related early 失智症 the chromosome completed has ||| -127.422 -6.516 -12.319 -5.829 -3.909 ||| -135.959 0 ||| scientists for vital early 失智症 the chromosomes completed has ||| -128.834 -6.419 -10.998 -5.370 -3.909 ||| -136.003 0 ||| scientists to vital early 失智症 completed the chromosome has ||| -127.423 -7.364 -11.577 -5.325 -3.909 ||| -136.009 0 ||| scientists to vital early 失智症 of chromosomes completed has ||| -127.427 -7.136 -11.612 -5.816 -3.909 ||| -136.086 |
|
This file contains the n-best translations, under the model. The first 10 lines that you see above are 10 best translations of the first sentence. Each line contains 4 fields. The first field is the index of the sentence (index 0 for the first sentence), the second field is the translation, the third field contains the each of the individual feature function scores for the translation (language model, rule translation probability, lexical translation probability, reverse lexical translation probability, and word penalty), and the final field is the overall score. To get the 1-best translations for each sentence in the test set without all of the extra information, you can run the following command: java -Xmx1g -cp $JOSHUA/bin \ -Dfile.encoding=utf8 joshua.util.ExtractTopCand \ example/example.nbest.srilm.out \ example/example.nbest.srilm.out.1best You cat then look at the 1-best output file by typing |
scientists to vital early 失智症 the chromosome completed has ( , paris 2 ) international a group of scientists said that they completed to human to chromosome 14 has , the chromosome with many diseases , including more years , may with the early 阿耳滋海默氏症 . this is to now completed has in the fourth chromosome , which 八千七百多万 to carry when ( dna ) . the weekly british science the study showed that the chromosome 14 are by 一千零五十 genes and gene fragments . the goal of gene scientists is to provide diagnostic tools to found of the flawed genes , are still provide a to stop these genes treatments . |
tar xfz data.tar.gz cd data/ gunzip -c es-en/full-training/europarl-v4.es-en.es.gz \ | perl scripts/tokenizer.perl -l es \ > es-en/full-training/training.es.tok gunzip -c es-en/full-training/europarl-v4.es-en.en.gz \ | perl scripts/tokenizer.perl -l en \ > es-en/full-training/training.en.tok |
After tokenization, we recommend that you normalize your data by lowercasing it. The system treats words with variant capitalization as distinct, which can lead to worse probability estimates for their translation, since the counts are fragmented. For other languages you might want to normalize the text in other ways.
You can lowercase your tokenized data with the following script:
cat es-en/full-training/training.en.tok \ | perl scripts/lowercase.perl \ > es-en/full-training/training.en.tok.lc cat es-en/full-training/training.es.tok \ | perl scripts/lowercase.perl \ > es-en/full-training/training.es.tok.lc
The untokenized file looks like this (gunzip -c es-en/full-training/europarl-v4.es-en.en.gz | head -3):
Resumption of the session
I declare resumed the session of the European Parliament adjourned on Friday 17 December 1999, and I would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period.
Although, as you will have seen, the dreaded 'millennium bug' failed to materialise, still the people in a number of countries suffered a series of natural disasters that truly were dreadful.
After tokenization and lowercasing, the file looks like this (head -3 es-en/full-training/training.en.tok.lc):
resumption of the session
i declare resumed the session of the european parliament adjourned on friday 17 december 1999 , and i would like once again to wish you a happy new year in the hope that you enjoyed a pleasant festive period .
although , as you will have seen , the dreaded ' millennium bug ' failed to materialise , still the people in a number of countries suffered a series of natural disasters that truly were dreadful .
You must preprocess your dev and test sets in the same way you preprocess your training data. Run the following commands on the data that you downloaded:
cat es-en/dev/news-dev2009.es \ | perl scripts/tokenizer.perl -l es \ | perl scripts/lowercase.perl \ > es-en/dev/news-dev2009.es.tok.lc cat es-en/dev/news-dev2009.en \ | perl scripts/tokenizer.perl -l en \ | perl scripts/lowercase.perl \ > es-en/dev/news-dev2009.en.tok.lc cat es-en/test/newstest2009.es \ | perl scripts/tokenizer.perl -l es \ | perl scripts/lowercase.perl \ > es-en/test/newstest2009.es.tok.lc cat es-en/test/newstest2009.en \ | perl scripts/tokenizer.perl -l en \ | perl scripts/lowercase.perl \ > es-en/test/newstest2009.en.tok.lc
Sometimes the amount of training data is so large that it makes creating word alignments extremely time-consuming and memory-intesive. We therefore provide a facility for subsampling the training corpus to select sentences that are relevant for a test set.
mkdir es-en/full-training/subsampled echo "training" > es-en/full-training/subsampled/manifest cat es-en/dev/news-dev2009.es.tok.lc es-en/test/newstest2009.es.tok.lc > es-en/full-training/subsampled/test-data java -Xmx1000m -Dfile.encoding=utf8 -cp "$JOSHUA/bin:$JOSHUA/lib/commons-cli-2.0-SNAPSHOT.jar" \ joshua.subsample.Subsampler \ -e en.tok.lc \ -f es.tok.lc \ -epath es-en/full-training/ \ -fpath es-en/full-training/ \ -output es-en/full-training/subsampled/subsample \ -ratio 1.04 \ -test es-en/full-training/subsampled/test-data \ -training es-en/full-training/subsampled/manifest
You can see how much the subsampling step reduces the training data, by yping wc -lw es-en/full-training/training.??.tok.lc es-en/full-training/subsampled/subsample.??.tok.lc:
1411589 39411018 training/training.en.tok.lc 1411589 41042110 training/training.es.tok.lc 671429 16721564 training/subsampled/subsample.en.tok.lc 671429 17670846 training/subsampled/subsample.es.tok.lc
Before extracting a translation grammar, we first need to create word alignments for our parallel corpus. In this example, we show you how to use the Berkeley aligner. You may also use Giza++ to create the alignments, although that program is a little unwieldy to install.
To run the Berkeley aligner you first need to set up a configuration file, which defines the models that are used to align the data, how the program runs, and which files are to be aligned. Here is an example configuration file (you should create your own version of this file and save it as training/word-align.conf):
## word-align.conf ## ---------------------- ## This is an example training script for the Berkeley ## word aligner. In this configuration it uses two HMM ## alignment models trained jointly and then decoded ## using the competitive thresholding heuristic. ########################################## # Training: Defines the training regimen ########################################## forwardModels MODEL1 HMM reverseModels MODEL1 HMM mode JOINT JOINT iters 5 5 ############################################### # Execution: Controls output and program flow ############################################### execDir alignments create saveParams true numThreads 1 msPerLine 10000 alignTraining ################# # Language/Data ################# foreignSuffix es.tok.lc englishSuffix en.tok.lc # Choose the training sources, which can either be directories or files that list files/directories trainSources subsampled/ sentences MAX ################# # 1-best output ################# competitiveThresholding |
|
To run the Berkeley aligner, first set an environment variable saying where the aligner's jar file is located (this environment variable is just used for convenience in this document, and is not necessary for running the aligner in general: export BERKELEYALIGNER="/path/to/berkeleyaligner/dir"
You'll need to create an empty directory called cd es-en/full-training/ mkdir -p example/test
After you've created the nohup java -d64 -Xmx10g -jar $BERKELEYALIGNER/berkeleyaligner.jar ++word-align.conf &
If the program finishes right away, then it probably terminated with an error. You can read the When you are aligning tens of millions of words worth of data, the word alignment process will take several hours to complete. While it is running, you can skip ahead and complete step 4, but not step 5. After you get comfortable using the aligner and after you've run through the whole Joshua training sequence, you can try experimenting with the amount of training data, the number of training iterations, and different alignment models (the Berkeley aligner supports Model 1, a Hidden Markov Model, and a syntactic HMM). Step 4: Train a language modelMost translation models also make use of an n-gram language model as a way of assigning higher probability to hypothesis translations that look like fluent examples of the target language. Joshua provides support for n-gram language models, either through a built in data structure, or through external calls to the SRI language modeling toolkit (srilm). To use large language models, we recommend srilm. If you successfully installed srilm in Step 1, then you should be able to train a language model with the following command: mkdir -p model/lm $SRILM/bin/macosx64/ngram-count \ -order 3 \ -unk \ -kndiscount1 -kndiscount2 -kndiscount3 \ -text training/training.en.tok.lc \ -lm model/lm/europarl.en.trigram.lm (Note: the above assumes that you are on a 64-bit machine running Mac OS X. If that's not the case, your path to ngram-count will be slightly different.) This will train a trigram language model on the English side of the parallel corpus. We use the
The The Given that the English side of the parallel corpus is a relatively small amount of data in terms of language modeling, it only takes a few minutes a few minutes to output the LM. The uncompressed LM is 144 megabytes large ( Step 5: Extract a translation grammarWe'll use the word alignments to create a translation grammar similar to the Chinese one shown in Step 1. The translation grammar is created by looking for where the foreign language phrases from the test set occur in the training set, and then using the word alignments to figure out which foreign phrases are aligned. Create a suffix array indexTo find the foreign phrases in the test set, we first create an easily searchable index, called a suffix array, for the training data. java -Xmx500m -cp $JOSHUA/bin/ \ joshua.corpus.suffix_array.Compile \ training/subsampled/subsample.es.tok.lc \ training/subsampled/subsample.en.tok.lc \ training/subsampled/training.en.tok.lc-es.tok.lc.align \ model This compiles the index that Joshua will use for its rule extraction, and puts it into a directory named Extract grammar rules for the dev setThe following command will extract a translation grammar from the suffix array index of your word-aligned parallel corpus, where the grammar rules apply to the foreign phrases in the dev set
mkdir mert
java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
joshua.prefix_tree.ExtractRules \
./model \
mert/news-dev2009.es.tok.lc.grammar.raw \
dev/news-dev2009.es.tok.lc &
Next, sort the grammar rules and remove the redundancies with the following Unix command: sort -u mert/news-dev2009.es.tok.lc.grammar.raw \ -o mert/news-dev2009.es.tok.lc.grammar You will also need to create a small "glue grammar", in a file called |
[S] ||| [X,1] ||| [X,1] ||| 0 0 0 [S] ||| [S,1] [X,2] ||| [S,1] [X,2] ||| 0.434294482 0 0 |
Step 6: Run minimum error rate trainingAfter we've extracted the grammar for the dev set we can run minimum error rate training (MERT). MERT is a method for setting the weights of the different feature functions the translation model to maximize the translation quality on the dev set. Translation quality is calculated according to an automatic metric, such as Bleu. Our implementation of MERT allows you to easily implement some other metric, and optimize your paramters to that. There's even a YouTube tutorial to show you how. To run MERT you will first need to create a few files:
Create a MERT configuration file. In this example we name the file |
### MERT parameters # target sentences file name (in this case, file name prefix) -r dev/news-dev2009.en.tok.lc -rps 1 # references per sentence -p mert/params.txt # parameter file -m BLEU 4 closest # evaluation metric and its options -maxIt 10 # maximum MERT iterations -ipi 20 # number of intermediate initial points per iteration -cmd mert/decoder_command # file containing commands to run decoder -decOut mert/news-dev2009.output.nbest # file prodcued by decoder -dcfg mert/joshua.config # decoder config file -N 300 # size of N-best list -v 1 # verbosity level (0-2; higher value => more verbose) -seed 12341234 # random number generator seed |
|
You can see a list of the other parameters available in our MERT implementation by running this command: java -cp $JOSHUA/bin joshua.zmert.ZMERT -h Next, create a file called |
lm ||| 1.000000 Opt 0.1 +Inf +0.5 +1.5 phrasemodel pt 0 ||| 1.066893 Opt -Inf +Inf -1 +1 phrasemodel pt 1 ||| 0.752247 Opt -Inf +Inf -1 +1 phrasemodel pt 2 ||| 0.589793 Opt -Inf +Inf -1 +1 wordpenalty ||| -2.844814 Opt -Inf +Inf -5 0 normalization = absval 1 lm |
|
Next, create a file called |
java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \ joshua.decoder.JoshuaDecoder \ mert/joshua.config \ dev/news-dev2009.es.tok.lc \ mert/news-dev2009.output.nbest |
|
Next, create a configuration file for joshua at |
lm_file=model/lm/europarl.en.trigram.lm tm_file=mert/news-dev2009.es.tok.lc.grammar tm_format=hiero glue_file=model/hiero.glue glue_format=hiero #lm config use_srilm=true lm_ceiling_cost=100 use_left_equivalent_state=false use_right_equivalent_state=false order=3 #tm config span_limit=10 phrase_owner=pt mono_owner=mono begin_mono_owner=begin_mono default_non_terminal=X goalSymbol=S #pruning config fuzz1=0.1 fuzz2=0.1 max_n_items=30 relative_threshold=10.0 max_n_rules=50 rule_relative_threshold=10.0 #nbest config use_unique_nbest=true use_tree_nbest=false add_combined_cost=true top_n=300 #remote lm server config, we should first prepare remote_symbol_tbl before starting any jobs use_remote_lm_server=false remote_symbol_tbl=./voc.remote.sym num_remote_lm_servers=4 f_remote_server_list=./remote.lm.server.list remote_lm_server_port=9000 #parallel deocoder: it cannot be used together with remote lm num_parallel_decoders=1 parallel_files_prefix=/tmp/ ###### model weights #lm order weight lm 1.0 #phrasemodel owner column(0-indexed) weight phrasemodel pt 0 1.4037585111897322 phrasemodel pt 1 0.38379188013385945 phrasemodel pt 2 0.47752204361625605 #arityphrasepenalty owner start_arity end_arity weight #arityphrasepenalty pt 0 0 1.0 #arityphrasepenalty pt 1 2 -1.0 #phrasemodel mono 0 0.5 #wordpenalty weight wordpenalty -2.721711092619053 |
|
Finally, run the command to start MERT: nohup java -cp $JOSHUA/bin \ joshua.zmert.ZMERT \ -maxMem 1500 mert/mert.config & While MERT is running, you can skip ahead to the first part of the next step and extract the grammar for the test set. Step 7: Decode a test setWhen MERT finishes, it will output a file Extract grammar rules for the test setBefore decoding the test set, you'll need to extract a translation grammar for the foreign phrases in the test set
java -Dfile.encoding=UTF8 -Xmx1g -cp $JOSHUA/bin \
joshua.prefix_tree.ExtractRules \
./model \
test/newstest2009.es.tok.lc.grammar.raw \
test/newstest2009.es.tok.lc &
Next, sort the grammar rules and remove the redundancies with the following Unix command: sort -u test/newstest2009.es.tok.lc.grammar.raw \ -o test/newstest2009.es.tok.lc.grammar Once the grammar extraction has completed, you can edit the cp mert/joshua.config.ZMERT.final test/joshua.config You'll need to edit the config file to replace java -Xmx1g -cp $JOSHUA/bin/ -Djava.library.path=$JOSHUA/lib -Dfile.encoding=utf8 \ joshua.decoder.JoshuaDecoder \ test/joshua.config \ test/newstest2009.es.tok.lc \ test/newstest2009.output.nbest After the decoder has finished, you can extract the 1-best translations from the n-best list using the following command: java -cp $JOSHUA/bin -Dfile.encoding=utf8 \ joshua.util.ExtractTopCand \ test/newstest2009.output.nbest \ test/newstest2009.output.1best Step 8: Recase and detokenizeYou'll notice that your output is all lowercased and has the punctuation split off. In order to make the output more readable to human beings (remember us?), it'd be good to fix these problems and use proper punctuation and spacing. These are called recasing and detokenization, respectively. We can do recasing using SRILM, and can do detokenization with a perl script. To build a recasing model first train a language model on true cased English text: $SRILM/bin/macosx64/ngram-count \ -unk \ -order 5 \ -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 \ -text training/training.en.tok \ -lm model/lm/training.TrueCase.5gram.lm Next, you'll need to create a list of all of the alternative ways that each word can be capitalized. This will be stored in a map file that lists a lowercased word as the key and associates it with all of the variant capitalization of that word. Here's an example perl script to create the map: |
#!/usr/bin/perl
#
# truecase-map.perl
# -----------------
# This script outputs alternate capitalizations
%map = ();
while($line = <>) {
@words = split(/\s+/, $line);
foreach $word (@words) {
$key = lc($word);
$map{$key}{$word} = 1;
}
}
foreach $key (sort keys %map) {
@words = keys %{$map{$key}};
if(scalar(@words) > 1 || !($words[0] eq $key)) {
print $key;
foreach $word (sort @words) {
print " $word";
}
print "\n";
}
}
|
cat training/training.en.tok | perl truecase-map.perl > model/lm/true-case.map Finally, recase the lowercased 1-best translation by running the SRILM $SRILM/bin/macosx/disambig \ -lm model/lm/training.TrueCase.5gram.lm \ -keep-unk \ -order 5 \ -map model/lm/true-case.map \ -text test/mt09.output.1best \ | perl strip-sent-tags.perl > test/mt09.output.1best.recased Where |
while($line = <>) {
$line =~ s/^\s*<s>\s*//g;
$line =~ s/\s*<\/s>\s*$//g;
print $line . "\n";
}
|

Global Autonomous Language Exploitation (GALE)

Multi-level modeling of language and translation