CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Shuo Sun; Kevin Duh

2020 EMNLP EMNLP 2020

CLIRMatrix: A massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval

Abstract

AbstractWe present CLIRMatrix, a massively large collection of bilingual and multilingual datasets for Cross-Lingual Information Retrieval extracted automatically from Wikipedia. CLIRMatrix comprises (1) BI-139, a bilingual dataset of queries in one language matched with relevant documents in another language for 139x138=19,182 language pairs, and (2) MULTI-8, a multilingual dataset of queries and documents jointly aligned in 8 different languages. In total, we mined 49 million unique queries and 34 billion (query, document, label) triplets, making it the largest and most comprehensive CLIR dataset to date. This collection is intended to support research in end-to-end neural information retrieval and is publicly available at [url]. We provide baseline neural model results on BI-139, and evaluate MULTI-8 in both single-language retrieval and mix-language retrieval settings.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Transfer Learning

🧭 Keyword Pioneer — wikipedia mining

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Shuo Sun , Kevin Duh

Topics

Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Application Areas > Transfer Learning Deep Learning > Learning Types > Transfer Learning

Keywords

transfer learning multilingual nlp cross-lingual information retrieval query-document matching neural information retrieval multilingual dataset bilingual dataset wikipedia mining

Download PDF

Related papers

Fast semantic parsing with well-typedness guarantees 2020

Detecting Objectifying Language in Online Professor Reviews 2020

Analogous Process Structure Induction for Sub-event Sequence Prediction 2020

Aspect Sentiment Classification with Aspect-Specific Opinion Spans 2020

Robust and Interpretable Grounding of Spatial References with Relation Networks 2020