Croissant: A Metadata Format for ML-Ready Datasets

Mubashara Akhtar; Omar Benjelloun; Costanza Conforti; Luca Foschini; Pieter Gijsbers; Joan Giner-Miguelez; Sujata Goswami; Nitisha Jain; Michalis Karamousadakis; Satyapriya Krishna; Michael Kuchnik; Sylvain Lesage; quentin lhoest; Pierre Marcenac; Manil Maskey; Peter Mattson; Luis Oala; Hamidah Oderinwale; Pierre Ruyssen; Tim Santos; Rajat Shinde; Elena Simperl; Arjun Suresh; Goeffry Thomas; Slava Tykhonov; Joaquin Vanschoren; Susheel Varma; Jos van der Velde; Steffen Vogler; Carole-Jean Wu; Luyao Zhang

2024 NIPS NeurIPS 2024

Croissant: A Metadata Format for ML-Ready Datasets

Abstract

Data is a critical resource for machine learning (ML), yet working with data remains a key friction point. This paper introduces Croissant, a metadata format for datasets that creates a shared representation across ML tools, frameworks, and platforms. Croissant makes datasets more discoverable, portable, and interoperable, thereby addressing significant challenges in ML data management. Croissant is already supported by several popular dataset repositories, spanning hundreds of thousands of datasets, enabling easy loading into the most commonly-used ML frameworks, regardless of where the data is stored. Our initial evaluation by human raters shows that Croissant metadata is readable, understandable, complete, yet concise.

👥 Mega-Team — 31 authors

🧭 Keyword Pioneer — metadata format

🌉 Interdisciplinary Bridge — Computer Science and Data Science & Analytics and Machine Learning

📈 Trend Setter — Databases

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mubashara Akhtar , Omar Benjelloun , Costanza Conforti , Luca Foschini , Pieter Gijsbers , Joan Giner-Miguelez , Sujata Goswami , Nitisha Jain , Michalis Karamousadakis , Satyapriya Krishna , Michael Kuchnik , Sylvain Lesage , quentin lhoest , Pierre Marcenac , Manil Maskey , Peter Mattson , Luis Oala , Hamidah Oderinwale , Pierre Ruyssen , Tim Santos , Rajat Shinde , Elena Simperl , Arjun Suresh , Goeffry Thomas , Slava Tykhonov , Joaquin Vanschoren , Susheel Varma , Jos van der Velde , Steffen Vogler , Carole-Jean Wu , Luyao Zhang

Topics

Machine Learning > Application Areas > Data Augmentation Machine Learning > Application Areas > Efficient Computing Data Science & Analytics > Methods > Data Mining Computer Science > Applications > Software Engineering Computer Science > Applications > Databases

Keywords

machine learning metadata format dataset interoperability data management machine learning dataset dataset management data interoperability dataset discovery

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024