2025 ACL ACL 2025

Data Gatherer: LLM-Powered Dataset Reference Extraction from Scientific Literature

Abstract

AbstractDespite growing emphasis on data sharing and the proliferation of open datasets, researchers face significant challenges in discovering relevant datasets for reuse and systematically identifying dataset references within scientific literature. We present Data Gatherer, an automated system that leverages large language models to identify and extract dataset references from scientific publications. To evaluate our approach, we developed and curated two high-quality benchmark datasets specifically designed for dataset identification tasks. Our experimental evaluation demonstrates that Data Gatherer achieves high precision and recall in automated dataset reference extraction, reducing the time and effort required for dataset discovery while improving the systematic identification of data sources in scholarly literature.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Machine Learning and Natural Language Processing
🧭 Keyword Pioneer — dataset reference extraction
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio