Large-Scale CLIR Dataset

The Large-Scale CLIR Dataset is a retrieval dataset built for Cross-Language Information Retrieval (CLIR).
The dataset is derived from Wikipedia and contains more 2.8 million English single-sentence queries with relevant documents from 25 other selected languages.

Terms of Use

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

If you use the corpus in your work, please cite:
Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui
Cross-lingual Learning-to-Rank with Shared Representations
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, LA, USA.
June 2018.

Data

All queries and documents in this dataset are extracted from the August 23, 2017 version of the Wikimpedia dump.
For practical purposes, each document is limited to the first 200 words of the article.
Empty documents and category pages are also filtered.

Relevance judgments are constructed from the inter-language links between English Wikipedia articless and Foreign Language Wikipedia articles.
A relevance level of (2) is assigned to the (English) cross-lingual mate, and level (1) to all other articles that link to the mate, AND are linked by the mate.

For a more detailed description of the corpus construction process, see the above publication.

Language #Doc #Query #SR
Arabic 535 324 194
Catalan 548 339 625
Chinese 951 463 462
Czech 386 233 720
Dutch 1908 687 1646
Finnish 418 273 665
French 1894 1089 4048
German 2091 938 4612
Italian 1347 808 2635
Japanese 1071 426 2912
Korean 394 224 343
Norwegian-Nynorsk 133 99 150
Norwegian-Bokma ĚŠl 471 299 663
Polish 1234 693 1777
Portuguese 973 611 1130
Romanian 376 199 251
Russian 1413 664 1656
Simple English 127 114 135
Spanish 1302 781 2113
Swahili 37 22 35
Swedish 3785 639 1430
Tagalog 79 48 23
Turkish 295 185 195
Ukrainian 704 348 565
Vietnamese 1392 354 257
(All numbers are in units of one thousand)
Statistics of CLIR Datasets: The number of documents (#Doc) in a foreign language and the number of English queries are shown. The number of "most relevant" documents is by definition equal to #Query. The number of "slightly relevant" documents is shown in the column #SR.

Format

The English queries data (wiki_en.queries file) can be found in the "English" folder.

Each of the other folders contains two data files:
1) Foreign Language documents data (.docs file)
2) relevance judgments (.qrels file)

The format of the English query file is:
EN-wiki-page-id [TAB] first sentence (with article title removed)

The format of a document file is:
[Foreign Language]-wiki-page-id [TAB] article

The format of the relevance judgments file is:
[Foreign Language]-wiki-page-id [TAB] EN-wiki-page-id [TAB] relevance-level

Download

Full Raw Data in all languages (6.6GB)

Datasplit (de, ja, fr, sw, tl) used in Sasaki et. al. 2018 (5.8GB)

Contact

Any questions about the dataset can be directed to Shuo Sun (ssun32@jhu.edu)

More information can be found in the following paper:
Shota Sasaki, Shuo Sun, Shigehiko Schamoni, Kevin Duh, and Kentaro Inui
Cross-lingual Learning-to-Rank with Shared Representations.
Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT), New Orleans, LA, USA. June 2018.