MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Francisco Valentini; Viviana Cotik; Damián Furman; Ivan Bercovich; Edgar Altszyler; Juan Manuel Pérez

2025 EMNLP EMNLP 2025

MessIRve: A Large-Scale Spanish Information Retrieval Dataset

Abstract

AbstractInformation retrieval (IR) is the task of finding relevant documents in response to a user query. Although Spanish is the second most spoken native language, there are few Spanish IR datasets, which limits the development of information access tools for Spanish speakers. We introduce MessIRve, a large-scale Spanish IR dataset with almost 700,000 queries from Google’s autocomplete API and relevant documents sourced from Wikipedia. MessIRve’s queries reflect diverse Spanish-speaking regions, unlike other datasets that are translated from English or do not consider dialectal variations. The large size of the dataset allows it to cover a wide variety of topics, unlike smaller datasets. We provide a comprehensive description of the dataset, comparisons with existing datasets, and baseline evaluations of prominent IR models. Our contributions aim to advance Spanish IR research and improve information access for Spanish speakers.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — query retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Francisco Valentini , Viviana Cotik , Damián Furman , Ivan Bercovich , Edgar Altszyler , Juan Manuel Pérez

Topics

Natural Language Processing > Applications > Information Retrieval Natural Language Processing > Resources & Methods > Multilingual NLP Machine Learning > Application Areas > Information Retrieval

Keywords

multilingual nlp information retrieval web search cross-lingual retrieval spanish language query retrieval

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025