Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?

Yuechen Xie; Jie Song; Huiqiong Wang; Mingli Song

2025 CVPR CVPR 2025

Training Data Provenance Verification: Did Your Model Use Synthetic Data from My Generative Model for Training?

Abstract

High-quality open-source text-to-image models have lowered the threshold for obtaining photorealistic images significantly, but also face potential risks of misuse. Specifically, suspects may use synthetic data generated by these generative models to train models for specific tasks without permission, when lacking real data resources especially. Protecting these generative models is crucial for the well-being of their owners. In this work, we propose the first method to this important yet unresolved issue, called Training data Provenance Verification (TrainProVe). The rationale behind TrainProVe is grounded in the principle of generalization error bound, which suggests that, for two models with the same task, if the distance between their training data distributions is smaller, their generalization ability will be closer. We validate the efficacy of TrainProVe across four text-to-image models (Stable Diffusion v1.4, latent consistency model, PixArt-\alpha, and Stable Cascade). The results show that TrainProVe achieves a verification accuracy of over 99% in determining the provenance of suspicious model training data, surpassing all previous methods. Code is available at https://github.com/xieyc99/TrainProVe.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Security & Privacy

🧭 Keyword Pioneer — training data provenance

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuechen Xie , Jie Song , Huiqiong Wang , Mingli Song

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Responsible AI Machine Learning > Optimization & Theory > Statistical Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Models > Generative Models Machine Learning > Learning Types > Transfer Learning Security & Privacy > Privacy Deep Learning > Learning Types > Transfer Learning Machine Learning > Optimization & Theory > Generalization

Keywords

transfer learning data provenance generalization error generalization error bound generative model synthetic datum text-to-image model model verification synthetic data detection training data provenance model attribution

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025