Training a MT system requires large amounts of high-quality parallel corpora. Scraping the Internet can provide more parallel corpora to train an MT system but web-scraped data might actually reduce the quality of the MT system if the scraped parallel corpora was generated by Machine Translation. Therefore, to train a MT system with web-scraped data, one must first sift through the web-scraped data to remove parallel sentences generated by Machine Translation. This paper discusses how to build a classifier to detect whether aligned English-Hebrew sentences in a parallel corpus was generated by Machine Translation or a human translator. It is our intuition that removing individual aligned sentences from web-scraped parallel corpora as opposed to removing entire aligned documents from the training data will improve the quality of an MT system.
Like most Machine Learning tasks, building a high-quality Machine Translation system requires a large amount of data. In Machine Translation, the training data typically is represented as a parallel corpus where the text in the second language is generated by translating the original document. Since Machine Translation tries to replicate human translation, a parallel corpus generated by human translators is viewed as the “gold standard”. Therefore, the parallel corpora that a MT system learns from should be translated by hand, or at least be deemed an acceptable translation by a human fluent in both the original and translated languages. Otherwise, if a MT system learns from parallel corpus generated by another MT system, for example Google Translate, the MT system will not learn how to translate correctly, but instead, will learn to mimic Google Translate.
Before it was even technically possible, many people hoped that the Internet would become a valuable source of parallel corpora. In 2003, Resnick and Smith lamented that parallel corpora “are not readily available in the necessary quantities”1.
Since then, recent works have taken advantage of the proliferation of data on the Internet. As many websites offer the same content in more than one language, crawling the Internet in order to massively increase the amount of data that MT systems can learn from is common. While such approaches would generally be applauded by data scientists and the broad Machine Learning community for increasing the amount of training data, in Machine Translation, scraping the web introduces a major source of sub-optimal data.
Parallel corpora scraped from the web are not guaranteed to be high-quality data as it has become increasingly easy and inexpensive to translate websites into many languages. For example, Google offers a free tool called “Website Translator” 2 that offers the ability to “add the power of Google Translate’s automatic translations to [a] website” for free. This tool allows a website to be translated into over 90 languages! Therefore, an MT system must only be trained on parallel corpora obtained by crawling the web that have been designated as being generated by human translators.
The need to clean web scraped parallel corpora has been apparent for over the past half decade. (Tsvetkov et. al. 2010) used a dictionary-based algorithm to extract parallel document pairs containing manually translated texts with a remarkable precision of 100%. Their approach is language agnostic but they applied it to Hebrew-English document pairs. More recently, (Rarrick et. al. 2012) showed that cleaning web scraped parallel corpora with a classifier can improve MT systems, specifically and drastically for lower density languages. Their approach was also language agnostic and the lower density language pairs they were concerned with were English-Latvian and English-Romanian.
The majority of these and other works deal with classifying parallel corpora as the product of Human Translation or Machine Translation on an document level. Since many documents are translated by humans with the aid of machine translation, there can still be useful and high quality translations in a parallel document pair where the majority of sentences have been translated by a machine3. The previous works would throw the baby out with the bath water4 as they would remove the entire document pair from the training data instead of extracting “high-quality” parallel sentences from the document pair. Therefore, this paper will classify each parallel sentence pair as Human or Machine Translation. Due to the limited scope of this project, we will not test the intuition that retaining “high-quality” parallel sentence pairs from “low-quality” parallel document pairs will improve MT systems5.
Therefore, we have created a classifier to detect whether aligned sentences in parallel corpora have been generated by MT or human translators. For the scope of this project, we will deal only with English-Hebrew parallel corpora and English-Spanish parallel corpus and will have one classifier for each pair. This way, we ignore issues of language detection and can build a classifier for a specific language pair . Our baseline for the English-Hebrew classifier is 50 percent accuracy. (Tsvetkov 2010) had a precision of 100% but they scraped the web for 3 months and used specific thresholds to optimize for precision over recall or accuracy.
Our classifier is using supervised learning to determine whether to categorize parallel corpus as Machine Translation or Human Translation. Therefore, before building the classifier, we must first have a set of parallel corpora tagged as “Machine Translation” and a set of parallel corpora tagged as “Human Translation”.
Finding a set of parallel corpora generated by human translation is relatively straight forward thanks to OPUS. For the both English-Hebrew and English-Spanishclassifier, we are using translated Wikipedia6 7 text provided by OPUS as “Human Translation”.
Finding a set of parallel corpora generated by Machine Translation is not as simple since there is no open and free dataset of parallel corpora specifically generated by Machine Translation. Therefore, as suggested by Professor Koehn, we tried running one side of the parallel corpora through different MT systems to generate a set of parallel corpora tagged as “Machine Translation.” Since the English dataset of Wikipedia that we are using contains roughly 30 Million characters, we will need to run 30 Million characters through an MT system: 30 Million to translate from English to Hebrew and 30 Million to translate from English to Spanish.
Unfortunately, running one side of the parallel corpora through different MT systems turned out to be nontrivial. The Google Translate API costs $20$ USD per million characters translated. After trying a few hacks to work around the Google Translate API’s restrictions, it was clear that using Google Translate API would be too costly8.
The next approach we tried was using Moses to generate a set of parallel corpora. However, Moses requires large language models and a lot of pre-processing. Even though Moses is installed on the CLSP cluster and many language models are already trained for some languages, we were unable to generate a set of parallel corpora through Moses.
The third and most successful tool we tried using was the Microsoft Bing Translate API. Like Google Translate API, Microsoft Bing Translate API is expensive. Table 1 shows the pricing for Microsoft Bing Translate API.
|# of characters (in Millions)||Price per Month (in USD)|
However, since each Microsoft account is allowed to translate 2 Million characters a month for free, we utilized the help of friends to translate 2 Million characters per friend for free. Currently, we have a list of roughly 20 ClientIDs and ClientSecrets, enabling us to translate roughly 40 Million characters. This was sufficient to translate 139,853 English sentences into Hebrew. 9
After translating 139,853 English sentences to generate a parallel corpus tagged as “Machine Translation”, we have a total of 279,706 English-Hebrew sentence pairs. We set aside 79,853 English-Hebrew “Human Translation” pairs and 79,853 English-Hebrew “Machine Translation” pairs for training data, 30,000 English-Hebrew “Human Translation” pairs and 30,000 English-Hebrew “Machine Translation” pairs for development data, and 30,000 English-Hebrew “Human Translation” pairs and 30,000 English-Hebrew “Machine Translation” pairs for test data to evaluate our classifier. Thus our ratio of dev:train:test data is roughly 21:57:21.
Following Rarrick et. al, the features used to train the classifier are language agnostic 10. However, since our classifier is not concerned with classifying entire documents as Human or Machine Translated, we ignore all document-level or HTML data. In total, our classifier uses $9$ features that can be divided into the following $3$ categories that roughly follow Rarrick et. al.:
The ratio of the number of characters in the source sentence to the number of characters in the target sentence.
The ratio of the number of tokens in the source sentence to the number of tokens in the target sentence.
The ratio of the mean token length in the source sentence to the mean token length in the target sentence.
The number of words in the target sentence that appear in the source sentence. Typically, if a MT system has never seen a word or phrase in the source sentence, it will just copy that word or phrase instead of poorly translating the word or phrase 11.
The ratio of words in target sentence that appear in the source sentence.
A binary indicator whether there are no words or every word12 in the target sentence that appear in the source sentence.
Bi-gram language models:
The number of tokens in the target sentence that align more accurately with the human translated bi-gram language model.
The number of tokens in the target sentence that align more accurately with the machine translated bi-gram language model.
The percentage of tokens in the target sentence that align more accurately with the machine translated bi-gram language model.
In Rarrick et. al, these $3$ features were integrated as one binary indicator feature. Also, the binary indicator feature used unigram language models as opposed to bi-gram language models.
We used different types of out of the box classifiers from the
python module. The tables below contain the results of the
for the different classifiers used.
|Classifier||HT Precision||HT Recall||MT Precision||MT Recall||Accuracy|
|Bernoulli Naive Bayes||98.18||41.56||62.93||99.23||70.36|
|Gausian Naive Bayes||94.75||64.94||73.33||96.40||80.67|
|Stochastic Gradient Descent||87.94||79.34||81.18||89.12||84.23|
As seen above, a Stochastic Gradient Descent Classifier and a SVM Classifier are the most accurate on the test data with the implemented features. Although the Bernoulli Naive Bayes classifier had the lowest accuracy, its precision for predicting an aligned sentence pair as Human Translated is impressive at 98.18% and very close to Tsvetkov and Wintner’s precision of 100%.
Surprisingly, our classifier’s accuracy was generally higher for the test data than the development data. This suggests that our test data was more similar to our training data than our development data was to our training data.
The results of the classifier are encouraging. With just these $9$ features, our classifier greatly surpassed our baseline14. However, in order to see if this classifier improves the greater issue of how to use web scraped parallel corpus to train a MT system, the next step would be to use this classifier to remove “low-quality” parallel sentences when training a MT system with web-scraped data. Since the features in the classifier are language agnostic, this classifier can be easily integrated into any pipeline that uses web scrape parallel corpora to train a MT system.
Since websites are also translated with other MT systems besides for Microsoft Bing, future work should include using other MT systems to generate “Machine Translated” parallel text.
Resnik and Smith 2003. . Computational Linguistics, 29:349–380.
Rarrick et. al. 2011. . mt-archive.
Tsvetkov, Wintner 2010. . Proc. LREC 2010.
Antonova, Misyurev 2011. . ACL W2011.
A good classifier would classify such a pair as Machine Translation output ↩
This is where the paper derives its title from. ↩
Future work would be to determine whether including “high-quality” sentence pairs extracted from multilingual documents classified as MT output in training data can improve a MT system ↩
However, one approach not tried is to create a single page website containing a large amount of text and translating the page using Google’s free Website Translator tool ↩
The Joshua decoder is currently translating more sentences but unfortunately we were unable to include it in our experiment by the deadline. ↩
This is assuming that both the source and target language use spaces to differentiate words in a sentence ↩
For example, Google Translate translates the English proper noun “Trello” as “Trello” in Hebrew instead of transliterating the word into Hebrew. ↩
Such sentences usually are editors notes or references. “New York, W.W. Norton amp; Co., 1947.” is an example. In the training data, there are 6,928 such sentences and 932 in the development data. ↩
With the training data shuffled, the loss of (soft-margin) linear Support Vector Machine, and a L2 norm penalty ↩
Although the baseline was naively very low ↩