HOW-TO GUIDE for Extracting Syntactically Constrained Paraphrases
This document gives instructions on how to use the software and data that I used in my EMNLP 2008 paper, entitled Syntactic Constraints on Paraphrases Extracted from Parallel Corpora. There are two intended audiences for this step-by-step guide: individuals who want to generate paraphrases to use in their own applications, and researchers who want to recreate my results and extend the method that I proposed. Steps 1-5 are for people in the first category, and the remaining steps are for people in the second category.
The materials that I provide include the following:
- The source code for my paraphrase extraction methods (both the baseline and the syntactically constrained versions).
- The complete set of training data that I used in the paper. This includes 10 bilingual parallel corpora that have been automatically word-aligned, suffix array indexes for them, parses for the English side of the parallel corpora, and a trigram language model.
- The test sentences and the complete set of paraphrases generated for all phrases that occur in the test sets which are up to five words long.
- The judgments that were collected during the manual evaluation process, and the perl scripts that I used to calculate the results and the inter-annotator agreement.
In order to run the software on the full data sets you'll need a 64-bit machine with a large amount of memory (I use 10 gigs). The data sets are quite large, so you'll need 6 GB of hard drive space too.
Step 1: Download the data
You can download the data here:
After untarring the files by typing
tar xf emnlp08.tar on the command line, you should have the following files:
Each of the subdirectories in
contain files similar to these:
parses file contains parses of the English side of the parallel corpus that were produced by the Bikel parser trained on the Penn Treebank. The
vocab files are suffix-array indexes of each side of the parallel corpus. The
alignments file contains word alignments produced by Giza++ and the Moses toolkit. The
europarl files contain the plain text of parallel corpus. The
europarl are sentence-aligned and contain one sentence per line.
Step 2: Edit the configuration file
Before running the software you'll need to edit the
paraphrase.properties.en configuration file located in the
emnlp08/software/ directory. You'll need to change the lines in bold to be the absolute path to your
# The source language is the language to paraphrase
# The parallel corpora are listed with three values:
# They should be in the appropriate format and use
# the naming convetions for a SuffixArrayParallelCorpus
# An arbitray number of parallel corpora can be specified
# The working directory is where the program writes incremental
# results which are deleted before the program exists
# The sample size specifies how many occurrences of
# a pharase and its translations should be examined
# when determinig translation model probabilities.
# This value specifies the maximum number of paraphrases
# to output for each phrase
# This value is the minimum paraphrase probability
# that will be printed.
You will also need to create a temporary directory to store the intermediate paraphrases that are extracted from each of the parallel corpora:
mkdir tmp. Make sure that the
paraphrases.working_directory field in the config file points to this.
Step 3: Assemble a list of phrases
Before you run the software you'll need to create a file containing a list of the phrases that you want to paraphrase, with one phrase per line. Your tokenization scheme should also match the one used for the English side of the parallel corpus, and the text should be lowercased. You can run the following commands to enumerate all of the phrases in a file with sentences with proper tokenization:
cat sentence-file | perl software/tokenizer.perl -l en | perl software/lowercase.perl > sentence-file.tokenized
java -cp software/linearb.jar phd.util.EnumPhrases sentence-file.tokenized 4 | sort | uniq > phrases
When you test the paraphrase model for the first time, you might want to create a short list of phrases (100 phrases or so). So that you can quickly figure out whether you're having any problems. After that you can try paraphrasing hundreds of thousands of phrases. Running on a long list of phrases will take some time.
For this example, I'm choosing 100 phrases from the middle of the corpus by running:
zcat paraphrases/phrases-to-paraphrase.gz | head -750000 | tail -100 > phrases. The first few phrases are:
the commission about article
the commission about article 4
the commission about the
the commission about the need
the commission accepted
the commission accepted many
the commission accepted many of
the commission accepts
the commission accepts amendment
the commission accepts amendment no
Step 4: Run the paraphrase extraction code
After you assemble the list of phrases that you want to paraphrase, you can extract syntactically constrained paraphrases by run the following command to start the software:
nohup java -d64 -Xmx10g -Dfile.encoding=utf8 -cp software/linearb.jar edu.jhu.paraphrasing.ParaphraseFinderWithSyntacticConstraints software/paraphrase.properties.en phrases phrases.paraphrased &
For those of you who aren't very familiar with Java, the arguments are the following:
-d64 -- this tells Java to operate in 64-bit mode which allow the program to use more than 2GB of memory.
-Xmx10g -- this tells Java to use 10GB of memory. If you don't have a computer with 10GB you could try using less memory, but I'm afraid that since the paraphrasing is so data-intensive it requires a lot of memory.
-Dfile.encoding=utf8 -- this tells Java that the files are all encoded in UTF8 format.
-cp software/linearb.jar -- this tells Java to use the linearb.jar jar file. If you want to look at the source code you can extract it by running the command
jar xf software/linearb.jar
edu.jhu.paraphrasing.ParaphraseFinderWithSyntacticConstraints -- This is the class which is run. If you extract the source code from the jar file you can find the source for this class in
software/paraphrase.properties.en -- This is paraphrase configuration that you edited it step 2.
phrases -- This is the input file containing the list of phrases that you created in step 3.
phrases.paraphrased -- This is the output file that the paraphrases will be written to.
As the code is running it will write the output file into the
tmp/ directory that you specified in the
paraphrase.properties.en file. It will create files containing paraphrases extracted from each of the individual parallel corpora before aggregating these in a final step. You can monitor these as the program is running. You can also follow the progress by looking at the nohup.out file which will contain infrequent message like:
Loading the europarl en-da parallel corpus ...
Loading parse trees from /Volumes/Models/emnlp/emnlp08/training/da-en/en_europarl_parses.txt
Looking up paraphrases in the europarl en-da corpus
When the program finishes running the tab-delimited output file should contain paraphrases that look like this (
head -25 phrases.paraphrased):
S the commission accepted the commission accepted 0.72946429
S the commission accepted the commission has accepted 0.11517857
S the commission accepted following on from the commission ' s comments 0.0625
S the commission accepted the committee could have delivered 0.04464286
S the commission accepted were accepted by the commission 0.03125
S the commission accepted the committee has taken on board 0.0125
S/(VP/NP NP) the commission accepted the commission accepted 0.875
S/(VP/NP NP) the commission accepted the commission adopted 0.125
S/(VP/NP PP) the commission accepted the commission accepted 0.875
S/(VP/NP PP) the commission accepted the commission approves 0.125
S/(VP/NP) the commission accepted the commission accepted 0.86237599
S/(VP/NP) the commission accepted the commission had 0.05555556
S/(VP/NP) the commission accepted the commission adopted 0.03687169
S/(VP/NP) the commission accepted the commission accept 0.01481481
S/(VP/NP) the commission accepted the commission approved 0.01388889
S/(VP/NP) . the commission accepted the commission accepted 0.92708333
S/(VP/NP) . the commission accepted the commission took up 0.04166667
S/(VP/NP) . the commission accepted the committee also accepted 0.03125
S/(VP/PP) the commission accepted the commission accepted 0.9
S/(VP/PP) the commission accepted the commission agreed 0.1
S/(VP/SBAR) . the commission accepted the commission accepted 0.55357143
S/(VP/SBAR) . the commission accepted the commission said 0.13214286
S/(VP/SBAR) . the commission accepted the commission agreed 0.1
S/(VP/SBAR) . the commission accepted the commission stated 0.07857143
S/(VP/SBAR) . the commission accepted the commission considered 0.05714286
The first column contains the syntactic label assigned to the phrase and the paraphrase in one or more sentence in the parallel corpora. Note that this can be a simple label like
S or a more complex CCG-style label like
S/(VP/NP NP) that indicates a sentence with an incomplete verb phrase to its right where the VP is missing two noun phrases to its right. See the paper for more details about these complex labels.
The second column contains the original phrases. Note that the first four phrases in the example
phrases file that we constructed in Step 3 do not have paraphrases. The first phrase for which a paraphrase is found is
the commission accepted. 48 of the 100 original phrases have paraphrases. You can figure this out by typing
cut -f2 phrases.paraphrased | uniq | wc -l at the command line.
The third column contains the paraphrases. There are a total of 1321 paraphrases generated for the 48 phrases (
wc -l phrases.paraphrased). We output up to 10 paraphrases per syntactic label, and many phrases have more than one label. You'll notice that many of the paraphrases are identical across different the different syntactic labels. If you ignore the labels and just count the unique paraphrases then there are 436 unique paraphrases (
cut -f2,3 phrases.paraphrased | sort | uniq | wc -l).
The fourth column contains the probabilities that are assigned to each of the paraphrases. This is calculated as described in Equations 6 and 7 in the paper.
In most cases the above instructions are all you'll need. If you'd like additional information about how to extract baseline paraphrases, and generate sets for manual evaluation in order to replicate the results that I reported in my paper, the following steps are for you!
Step 5 (optional): Extracting baseline paraphrases
If you would like to create paraphrases with the baseline model you can do so by running this command:
nohup java -d64 -Xmx4g -Dfile.encoding=utf8 -cp software/linearb.jar edu.jhu.paraphrasing.ParaphraseFinder software/paraphrase.properties.en phrases phrases.paraphrased-baseline &
Once that is done running you can take a look at its output (
head -25 phrases.paraphrased-baseline):
the commission about the the commission 0.17446251
the commission about the commission 0.11266017
the commission about the the commission about the 0.07509643
the commission about the the commission on the 0.07437752
the commission about the the commission about 0.05844822
the commission about the the 0.02667624
the commission about the the commission on 0.02535433
the commission about the on the 0.02387995
the commission about the on 0.01289723
the commission about the commission about 0.01034483
the commission accepted the commission accepted 0.29974536
the commission accepted commission 0.05675223
the commission accepted the commission 0.0558402
the commission accepted the commission has accepted 0.05148735
the commission accepted the commission adopted 0.0478455
the commission accepted the commission agreed 0.04574657
the commission accepted the commission approved 0.02165048
the commission accepted commission has 0.02058147
the commission accepted the commission has 0.01778682
the commission accepted commission accepted 0.01420254
the commission accepted many the commission accepted many 0.6
the commission accepted many commission accepted many 0.21875
the commission accepted many warm 0.125
the commission accepted many many 0.01875
the commission accepted many commission 0.0125
You'll notice that the file now only has three columns since the baseline model doesn't care about syntactic labels. You'll also notice that more of the original phrases have paraphrases. The baseline model generates 557 unique paraphrases for 65 of the 100 phrases (
wc -l phrases.paraphrased-baseline ; cut -f1 phrases.paraphrased-baseline | uniq | wc -l). Many of them are pretty bad. I think that you'll agree that we've improved things with the syntactic constraints.
The next page gives instructions on how to more rigorously evaluate whether the syntactically-constrained paraphrase models is better than the baseline model.