Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

Zixuan Wang; Yu Sun; Hongwei Wang; Baoyu Jing; Xiang Shen; Xin Dong; Zhuolin Hao; Hongyu Xiong; Yang Song

2025 EMNLP EMNLP 2025

Reasoning-Enhanced Domain-Adaptive Pretraining of Multimodal Large Language Models for Short Video Content Governance

Abstract

AbstractShort video platforms are evolving rapidly, making the identification of inappropriate content increasingly critical.Existing approaches typically train separate and small classification models for each type of issue, which requires extensive human-labeled data and lacks cross-issue generalization.We propose a reasoning-enhanced multimodal large language model (MLLM) pretraining paradigm for unified inappropriate content detection. To address the distribution gap between short video content and the original pretraining data of MLLMs, as well as the complex issue definitions, we introduce three targeted pretraining tasks:(1) Caption, to enhance the MLLM’s perception of video details;(2) Visual Question Answering (VQA), to deepen the MLLM’s understanding of issue definitions and annotation guidelines;(3) Chain-of-Thought (CoT), to enhance the MLLM’s reasoning capability.Experimental results show that our pretraining approach significantly improves the MLLM’s performance in both zero-shot and supervised fine-tuning (SFT) settings.In addition, our pretrained model demonstrates strong generalization capabilities to emergent, previously unseen issues.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zixuan Wang , Yu Sun , Hongwei Wang , Baoyu Jing , Xiang Shen , Xin Dong , Zhuolin Hao , Hongyu Xiong , Yang Song

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Domain Adaptation Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models Computer Vision > Core AI > Multimodal Learning Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Domain Adaptation Deep Learning > Models > Vision-Language Models

Keywords

zero-shot learning domain adaptation visual question answering chain-of-thought reasoning multimodal learning content moderation video understanding multimodal large language model large language model domain-adaptive pretraining content detection

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025