"What is Different Between These Datasets?" A Framework for Explaining Data Distribution Shifts

Varun Babbar*; Zhicheng Guo*; Cynthia Rudin

2025 JMLR JMLR 2025

"What is Different Between These Datasets?" A Framework for Explaining Data Distribution Shifts

Abstract

The performance of machine learning models relies heavily on the quality of input data, yet real-world applications often face significant data-related challenges. A common issue arises when curating training data or deploying models: two datasets from the same domain may exhibit differing distributions. While many techniques exist for detecting such distribution shifts, there is a lack of comprehensive methods to explain these differences in a human-understandable way beyond opaque quantitative metrics. To bridge this gap, we propose a versatile framework of interpretable methods for comparing datasets. Using a variety of case studies, we demonstrate the effectiveness of our approach across diverse data modalities—including tabular data, text data, images, time-series signals – in both low and high-dimensional settings. These methods complement existing techniques by providing actionable and interpretable insights to better understand and address distribution shifts. [abs] [ pdf ][ bib ] [ code ] © JMLR 2025. (edit, beta)

❓ The Questioner

🌉 Interdisciplinary Bridge — Data Science & Analytics and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Varun Babbar* , Zhicheng Guo* , Cynthia Rudin

Topics

Machine Learning > Application Areas > Domain Generalization Data Science & Analytics > Methods > Data Mining

Keywords

domain generalization distribution shift data modality dataset comparison interpretable method

Download PDF

Related papers

On the Natural Gradient of the Evidence Lower Bound 2025

Four Axiomatic Characterizations of the Integrated Gradients Attribution Method 2025

Extending Temperature Scaling with Homogenizing Maps 2025

Ontolearn---A Framework for Large-scale OWL Class Expression Learning in Python 2025

An Axiomatic Definition of Hierarchical Clustering 2025