How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning

Yao Tong; Jiayuan Ye; Sajjad Zarifzadeh; Reza Shokri

2025 ICLR ICLR 2025

How much of my dataset did you use? Quantitative Data Usage Inference in Machine Learning

Abstract

How much of my data was used to train a machine learning model? This is a critical question for data owners assessing the risk of unauthorized usage of their data to train models. However, previous work mistakenly treats this as a binary problem—inferring whether all-or-none or any-or-none of the data was used—which is fragile when faced with real, non-binary data usage risks. To address this, we propose a fine-grained analysis called Dataset Usage Cardinality Inference (DUCI), which estimates the exact proportion of data used. Our algorithm, leveraging debiased membership guesses, matches the performance of the optimal MLE approach (with a maximum error <0.1) but with significantly lower (e.g., $300 \times$ less) computational cost.

❓ The Questioner

Authors

Yao Tong , Jiayuan Ye , Sajjad Zarifzadeh , Reza Shokri

Download PDF

Related papers

Gramian Multimodal Representation Learning and Alignment 2025

Separation Power of Equivariant Neural Networks 2025

What should a neuron aim for? Designing local objective functions based on information theory 2025

Regret-Optimal List Replicable Bandit Learning: Matching Upper and Lower Bounds 2025

CHASE-SQL: Multi-Path Reasoning and Preference Optimized Candidate Selection in Text-to-SQL 2025