XLEL-21: Cross-language entity linking in twenty-one languages
Version 1.1
8 November 2011


OVERVIEW
========

The XLEL-21 collection was developed to support the training and evaluation of
cross-language linking of named entities from twenty-one non-English languages
into an English knowledge base.


CREATORS
========

Created at the Johns Hopkins University Human Language Technology Center of
Excellence (HLTCOE) by:
	Dawn Lawrie		lawrie "at" cs.loyola "dot" edu
	James Mayfield		james.mayfield "at" jhuapl "dot" edu
	Paul McNamee		paul.mcnamee "at" jhuapl "dot" edu
	Douglas W. Oard		oard "at" umd "dot" edu
Please direct comments or questions to James Mqyfield.


RESTRICTIONS
============

The portions of the collection developed by HLTCOE are in the public domain and
are freely distributable.  If you use the collection in your research, we ask
that you cite the following paper:

@inproceedings{clef:mayfield:2011,
	Author =    "James Mayfield and Dawn Lawrie and Paul McNamee and
	       	     Douglas W. Oard",
	Booktitle = "Multilingual and Multimodal Information Access Evaluation:
                     Second International Conference of the Cross-Language
                     Evaluation Forum",
	Title =     "Building a Cross-Language Entity Linking Collection in
	      	     Twenty-One Languages",
	Year =      "2011",
        editor =    "Pamela Forner and Julio Gonzalo and Jaana Keklinen and
                     Mounia Lalmas and Maarten de Rijke",
        publisher = "Springer",
        series =    "Lecture Notes in Computer Science",
        volume =    "6941",
        isbn =      "978-3-642-23707-2",
	pages =     "3--13",
}

The document collections are distributed by others, and have the
following restrictions:

The SE Times collections gives the following information in
<http://www.setimes.com/cocoon/setimes/xhtml/en_GB/document/setimes/footer/disclaimer/disclaimer>:

    "Unless a copyright is indicated, information on the site is in the public
    domain and may be copied and distributed without permission. Citation of the
    original source of the information is appreciated. If a copyright is
    indicated on a photo, graphic or other material, permission to copy these
    materials must be obtained from the original source."

The Europarl collection gives the following information in
<http://www.statmt.org/europarl/>:

    "We are not aware of any copyright restrictions of the material."

The Project Syndicate collection gives no information about the status of the
materials; we make no representations as to the status of those documents.

The LDC collections are subject to LDC membership and distribution rules; see
<http://www.ldc.upenn.edu/> for more information.


DISCLAIMER
==========

The Creators of XLEL-21 and the HLTCOE make no warranty, expressed or implied,
including warranties of merchantability and fitness for a particular purpose
with respect to the documents, software, or data contained in this collection,
nor do they assume legal liability for the accuracy, completeness or usefulness
of any information, product or process disclosed herein, nor do they represent
that use of such information, product or process would not infringe on privately
owned rights.


FILES
=====

This directory is organized as follows:

	README.txt: this file
	Query files. Each leaf directory contains four files:
		1. PER-all-queries: all of the queries for the language/collection
		2. PER-train-queries: suggested query set for training
		3. PER-development-queries: suggested query set for devtest
		4. PER-eval-queries: suggested query set for evaluation testing
	The following query directories are available:
		monolingual-queries: queries by collection, in English
		machine-aligned-queries: queries by non-English language, obtained
		  by automatic projection of English queries onto target language
		  documents
		hand-curated-queries: queries by non-English language, manually
		  curated to ensure reasonable projections. These queries are to
		  be preferred, but are not yet available in all languages
	Scripts:
		BuildLDCChineseParallelCorpus.pl: Convert your copy of
		  LDC2005T10 to a parallel document collection. You
		  will need to download and install Encode-HanExtra from CPAN to
		  use this script; the URL is <http://search.cpan.org/dist/Encode-HanExtra/>.
		  The following errors are generated; we hope to
		  correct these in a future release:
		  	  big5plus "\xFC" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFC" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFC" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\x85" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\x85" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\x85" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFB" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFD" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFB" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFD" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFB" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFB" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFA" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\xFA" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\x8D" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\x8A" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.
			  big5plus "\x8D" does not map to Unicode at BuildLDCChineseParallelCorpus.pl line 34.

		BuildProjSyndParallelCorpus.pl: Convert the Project
		  Syndicate data (from training-parallel.tar) to a parallel
		  document collection


QUERIES
=======

This collection contains (uncurated) training and test queries for cross-language entity
linking in twenty-one languages:

Language	Collection	Queries	Non-NIL
Arabic (ar)	Arabic		 2,829	   661
Chinese (zh)	Chinese		 1,958	   956
Danish (da)	Europarl	 2,105	 1,096
Dutch (nl)	Europarl	 2,131	 1,087
Finnish (fi)	Europarl	 2,038	 1,049
Italian (it)	Europarl	 2,135	 1,087
Portuguese (pt)	Europarl	 2,119	 1,096
Swedish (sv)	Europarl	 2,153	 1,107
Czech (cs)	ProjSynd	 1,044	   722
French (fr)	ProjSynd	   885	   657
German (de)	ProjSynd	 1,086	   769
Spanish (es)	ProjSynd	 1,028	   743
Albanian (sq)	SETimes		 4,190	 2,274
Bulgarian (bg)	SETimes		 3,737	 2,068
Croatian (hr)	SETimes		 4,139	 2,257
Greek (el)	SETimes		 3,890	 2,129
Macedonian (mk)	SETimes		 3,573	 1,956
Romanian (ro)	SETimes		 4,355	 2,368
Serbian (sr)	SETimes		 3,943	 2,156
Turkish (tr)	SETimes		 3,991	 2,169
Urdu (ur)	Urdu		 1,828	 1,093
Total				55,157	29,500

The collection also contains monolingual English queries by collection:

Collection	Queries	Unknown	Non-NIL
Arabic		 6,303	 3,274	  695		
Chinese		15,591	13,572	  982
Europarl	 3,414	 1,101	1,184
ProjSynd	 3,560	 2,307	  872
SETimes		14,148	 9,778	2,377
Urdu		14,160	12,197	1,168

Query files are UTF-8-encoded tab-separated values (tsv) files with no header
line. The five fields on each line are, in order:

     1. a query ID (a string uniquely identifying the query)
     2. a query string (a string referring to a named entity)
     3. a query document (the name of a document containing the query string)
     4. an entity type (PER for person, GPE for geo-political entity, ORG for
        organization, UKN for unknown. This collection should be 100% PER)
     5. a KBID (a string indicating which entity in the knowledge base the query
        string refers to, or the string NIL if the entity is not in the KB, or the
        string UNKNOWN if no ground truth is available for the query)


DOCUMENTS
=========

We are not currently distributing documents with the collection.  Two collections
are currently available externally at:
	  Europarl: <http://www.cs.umbc.edu/~mcnamee/europarl-docs.tar.bz2>
	  SETimes:  <http://www.cs.umbc.edu/~mcnamee/setimes-docs.tar.bz2>
The Europarl documents are a subset of Europarl v5. Once the COE Web Site is
better able to host large files, we will include the Europarl and SETimes
documents there.

The Project Syndicate data can be obtained from
<http://www.statmt.org/wmt10/translation-task.html>, as "Parallel corpus
training data" under the heading "Download." The script
BuildProjSyndParallelCorpus.pl will build the query documents once the
collection has been downloaded and uncompressed; run the script with no
arguments to see its correct usage.

The Arabic, Chinese and Urdu collections are produced by the Linguistic Data
Consortium (<http://www.ldc.upenn.edu/>), and should be obtained from them.  The
LDC catalog numbers for three collections are:
	    Arabic:	LDC2004T18
	    Chinese:	LDC2005T10
	    Urdu:	LDC2006E110

Like the Project Syndicate data, the Urdu data need to be assembled into
articles before being used for entity linking.  The script
BuildUrduParallelCorpus.pl will do so; run it with no parameters for usage
information. THIS SCRIPT IS NOT PART OF THE CURRENT DISTRIBUTION; we
hope to bring our Urdu offerings up to date soon.


KNOWLEDGE BASE
==============

The TAC knowledge base is produced by the Linguistic Data Consortium as
catalogue number LDC2009E58; see:
   <http://projects.ldc.upenn.edu/kbp/data/>
It is generally available to participants in the TAC evaluation:
   <http://www.nist.gov/tac/>
Researchers who need but are not able to afford LDC data should contact LDC to
discuss alternative arrangements.


TRAINING/TEST SPLIT
===================

The queries in each collection were each divided into training (60%),
development (20%), and test (20%) at the collection level, so that all languages
that come from a particular collection have the same division.


SUPPORTING FAIR MACHINE TRANSLATION
===================================

If you would like to train your own machine translation system from the parallel
documents provided here, we recommend placing all documents that are the source
of queries with known answers (i.e., that have something other than 'Unknown' as
the KB referent) in the evaluation partition. The remaining documents can be
used for training and development.

KNOWN LIMITATIONS
=================

Queries in this version of the collection are automatically generated and are
known to contain some misalignments.  The degree of this problem is known to
vary by language.  Users are encouraged to check for the availablility of future
versions of this test collection in which we expect to make available manually
verified queries.

Document alignments were produced automatically, and it is likely that there are
errors in the segmentation and alignment processes. Caveat emptor.


ACKNOWLEDGMENTS
===============

We are grateful to the many Mechanical Turk annotators who provided us with
fast, accurate responses to our requests.

Curation assistance was freely provided by Tan Xu, Mohammad Raunak,
Mossaab Bagdouri, and Damianos Karakos, for which
we are grateful.

We are grateful to the creators of the Europarl, Project Syndicate and
South-East European Times collections, the paper citations for which are:

Europarl:
@InProceedings{MTS:koehn:2005,
  address =   "Phuket, Thailand",
  author =    "Philipp Koehn",
  booktitle = "Proceedings of the Tenth Machine Translation Summit",
  pages =     "79--86",
  title =     "Europarl: A Parallel Corpus for Statistical Machine Translation",
  year =      "2005",
}


Project Syndicate:

None provided.

SE Times:

@InProceedings{lrec:tyers:2010,
  title =        "South-{E}ast {E}uropean {T}imes: {A} parallel corpus of
  		  {B}alkan languages",
  author =       "Francis Tyers and Murat Serdar Alperen",
  pages =        "49--53",
  booktitle =    "Proceedings of the LREC Workshop on Exploitation of Multilingual
		  Resources and Tools for Central and (South-) Eastern European
		  Languages",
  year =         "2010",
  address =      "Valetta, Malta",
  editor =       "Stelios Piperidis and Milena Slavcheva and Cristina Vertan",
}


VERSION HISTORY
===============

	Version 1.1	2011.11.08	Subset of hand-curated queries
	Version 1.0	2011.09.19	Original distribution
