IDseq: Decoupled and Sequentially Detecting and Grounding Multi-Modal Media Manipulation

Runxin Liu; Tian Xie; Jiaming Li; Lingyun Yu; Hongtao Xie

2025 AAAI AAAI 2025

IDseq: Decoupled and Sequentially Detecting and Grounding Multi-Modal Media Manipulation

Abstract

Abstract Detecting and grounding multi-modal media manipulation aims to categorize the type and localize the region of manipulation for image-text pairs in both two modalities. Existing methods have not sufficiently explored the intrinsic properties of the manipulated images, which contain both forgery and content features, leading to inefficient utilization. To address this problem, we propose an Image-Driven Decoupled Sequential Framework (IDseq), designed to decouple image features and rationally integrate them to accomplish different sub-tasks effectively. Specifically, IDseq employs two specially designed disentangled losses to guide the disentangled learning of forgery and content features. To efficiently leverage these features, we propose a Decoupled Image Manipulation Decoder (DIMD) that processes image tasks within a decoupled schema. We mitigate their exclusive competition by separating the image tasks into forgery-relevant and content-relevant components and training them without gradient interaction. Additionally, we utilize content features enhanced by the proposed Manipulation Indicator Generator (MIG) for the text tasks, which provide the maximal visual information as a reference while eliminating interference from unverified image data. Extensive experiments show the superiority of our IDseq, where it notably outperforms SOTA methods on the fine-grained classification by 3.8% in mAP and the forgery face grounding by 8.7% in IoUmean, even 1.3% in F1 on the most challenging manipulated text grounding.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — multimodal media manipulation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Runxin Liu , Tian Xie , Jiaming Li , Lingyun Yu , Hongtao Xie

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Computer Vision > Analysis > Object Detection Deep Learning > Learning Types > Multi-Modal Learning

Keywords

multi-modal learning feature disentanglement disentangled representation image-text pair image grounding forgery detection disentangled learning multimodal media manipulation media manipulation multi-modal manipulation

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025