Johns Hopkins University

Research


System Combination for Machine Translation Systems

This work is done as a part of DARPA's GALE project for improving Arabic to English Machine Translation. I am part of JHU's System Combination team, where we focus on ways to combine output of multiple MT systems.There are several approaches to machine translation which can be classified as phrase-based, hierarchial, syntaxbased which are equally good in their translation quality even though the underlining frameworks are completely different. The motivation behind System Combination arises from this diversity in the state-of-art MT systems, which suggests that systems with different paradigms make different errors, and can be made better by combining strength of these systems. At JHU we use a confusion Network based pipeline for System Combination.

Here is a brief report on the pipeline and the areas I have been working on. Here is a poster I presented for one of my Machine Learning class.

We recently participated in WMT'10 System Combination task.

Self Adjustable Bootstrapping for Named Entity Set Expansion [JHU Summer Workshop]

This work was done at JHU CLSP Summer Workshop with Satoshi Sekine.

Information retrieval applications are increasingly benefited from a list of Named Entities. There have been various shared tasks like CoNLL 2002 and CoNLL 2003, with the aim of improving Named-Entity taggers. Use of CRF's and other supervised methods which treat NER as a sequence labeling task have shown to achieve results which are very close to human performance in terms of accuracy. However the usefulness of supervised methods are restricted by availability of tagged(training) data. Training data is available in abundance for a few languages like English, Spanish, Dutch, however generating training data for new languages and domains is a costly and time consuming affair. This report describes a bootstrapping based approach to Named Entity Extraction, which works with a handful of seed examples for a category and large amount of untagged text to retrieve a bigger set of entities. Since seed examples and large amount of untagged corpus are easily available, this method can be adapted to any domain and language. We worked on NE categories a hierarchy of these categories can be found here. We built a "Self-Adjustable BootStrapping Model" for NE extraction, this models adjusts parameters across different categories and domains.

Here is a brief report and my presentation at the Summer workshop, describing the model.

Named Entity Recognition for Resource poor Languages

The motivation behind our research is to improve performance on Named Entity recognition task for languages which donot have training data or it is expensive to gather tagged data. We are experimenting with English, Spanish, Dutch and German. The basic intuition is to use features in a language (e.g. English) for which resources are easily available and using domain adaptation to adapt them to a new language, to improve performance of NE tagger. We are currently using a CRF based Named Entity Tagger developed on mallet-0.4 package. Again since we assume that our target languages don't have many resources available we donot use parallel corpus or Machine Translation systems.

Data Set: CONLL 2002 and CONLL 2003