Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

Yiheng Li; Yang Yang; Zichang Tan; Huan Liu; Weihua Chen; Xu Zhou; Zhen Lei

2025 CVPR CVPR 2025

Unleashing the Potential of Consistency Learning for Detecting and Grounding Multi-Modal Media Manipulation

Abstract

To tackle the threat of fake news, the task of detecting and grounding multi-modal media manipulation (DGM4) has received increasing attention. However, most state-of-the-art methods fail to explore the fine-grained consistency within local content, usually resulting in an inadequate perception of detailed forgery and unreliable results. In this paper, we propose a novel approach named Contextual-Semantic Consistency Learning (CSCL) to enhance the fine-grained perception ability of forgery for DGM^4. Two branches for image and text modalities are established, each of which contains two cascaded decoders, i.e., Contextual Consistency Decoder (CCD) and Semantic Consistency Decoder (SCD), to capture within-modality contextual consistency and across-modality semantic consistency, respectively. Both CCD and SCD adhere to the same criteria for capturing fine-grained forgery details. To be specific, each module first constructs consistency features by leveraging additional supervision from the heterogeneous information of each token pair. Then, the forgery-aware reasoning or aggregating is adopted to deeply seek forgery cues based on the consistency features. Extensive experiments on DGM4 datasets prove that CSCL achieves new state-of-the-art performance, especially for the results of grounding manipulated content. Codes and weights are avaliable at https://github.com/liyih/CSCL.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — media manipulation detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yiheng Li , Yang Yang , Zichang Tan , Huan Liu , Weihua Chen , Xu Zhou , Zhen Lei

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Anomaly Detection Computer Vision > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Computer Vision Deep Learning > Learning Types > Multi-Modal Learning

Keywords

multimodal learning multi-modal learning semantic consistency fine-grained detection consistency learning forgery detection contextual consistency media manipulation detection media manipulation

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025