CLIRMatrix

A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Citation

@inproceedings{sun2020clirmatrix, title={CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval}, author={Sun, Shuo and Duh, Kevin}, booktitle={Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)}, pages={4160--4170}, year={2020} }

BI-139

A bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs.

MULTI-8

A multilingual dataset of queries and documents jointly aligned in 8 different languages.

Documents

A collection of Wikipedia documents.
Format is document ID<TAB>text.

This work is licensed under a Creative Commons Attribution 4.0 International License.