This work is done as a part of DARPA's GALE project for improving Arabic to
English Machine Translation. I am part of JHU's System Combination team,
where we focus on ways to combine output of multiple MT systems.There are
several approaches
to machine translation which can be
classified as phrase-based, hierarchial, syntaxbased
which are equally good in their translation quality
even though the underlining frameworks are
completely different. The motivation behind
System Combination arises from this diversity
in the state-of-art MT systems, which suggests
that systems with different paradigms
make different errors, and can be made better
by combining strength of these systems. At JHU we use a confusion Network based
pipeline for System Combination.
Here is a brief report on the pipeline and the areas I
have been working on.
Here is a poster I presented for one of my
Machine Learning class.
We recently participated in WMT'10 System Combination task.
This work was done at JHU CLSP Summer Workshop with Satoshi Sekine.
Information retrieval applications are increasingly
benefited from a list of Named Entities.
There have been various
shared tasks like CoNLL 2002 and CoNLL 2003,
with the aim of improving Named-Entity taggers.
Use of CRF's and other supervised methods which
treat NER as a sequence labeling task have shown to
achieve results which are very close to human performance
in terms of accuracy. However the usefulness
of supervised methods are restricted by availability
of tagged(training) data. Training data is
available in abundance for a few languages like English,
Spanish, Dutch, however generating training
data for new languages and domains is a costly and
time consuming affair. This report describes a bootstrapping
based approach to Named Entity Extraction,
which works with a handful of seed examples
for a category and large amount of untagged text to
retrieve a bigger set of entities. Since seed examples
and large amount of untagged corpus are easily
available, this method can be adapted to any domain
and language. We worked on NE categories a hierarchy of these
categories can be found here. We built a
"Self-Adjustable BootStrapping Model" for NE extraction, this models adjusts
parameters across different
categories and domains.
Here is a brief report and my presentation at the Summer workshop, describing the model.
The motivation behind our research is to improve performance on Named Entity recognition task for languages which donot have training data or it is expensive to gather tagged data. We are experimenting with English, Spanish, Dutch and German. The basic intuition is to use features in a language (e.g. English) for which resources are easily available and using domain adaptation to adapt them to a new language, to improve performance of NE tagger. We are currently using a CRF based Named Entity Tagger developed on mallet-0.4 package. Again since we assume that our target languages don't have many resources available we donot use parallel corpus or Machine Translation systems.