GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-Grained Video-Language Learning

Yicheng Wang; Zhikang Zhang; Jue Wang; David Fan; Zhenlin Xu; Linda Liu; Xiang Hao; Vimal Bhat; Xinyu Li

2025 WACV WACV 2025

GEXIA: Granularity Expansion and Iterative Approximation for Scalable Multi-Grained Video-Language Learning

Abstract

In various video-language learning tasks the challenge of achieving cross-modality alignment with multi-grained data persists. We propose a method to tackle this challenge from two crucial perspectives: data and modeling. Given the absence of a multi-grained video-text pretraining dataset we introduce a Granularity EXpansion (GEX) method with Integration and Compression operations to expand the granularity of a single-grained dataset. To better model multi-grained data we introduce an Iterative Approximation Module (IAM) which embeds multi-grained videos and texts into a unified low-dimensional semantic space while preserving essential information for cross-modal alignment. Furthermore GEXIA is highly scalable with no restrictions on the number of video-text granularities for alignment. We evaluate our work on three categories of video tasks across seven benchmark datasets showcasing state-of-the-art or comparable performance. Remarkably our model excels in tasks involving long-form video understanding even though the pretraining dataset only contains short video clips.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — iterative approximation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yicheng Wang , Zhikang Zhang , Jue Wang , David Fan , Zhenlin Xu , Linda Liu , Xiang Hao , Vimal Bhat , Xinyu Li

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Machine Learning > Core Methods > Embedding Learning Computer Vision > Processing > Video Understanding Natural Language Processing > Understanding > Semantic Analysis Deep Learning > Learning Types > Multi-Modal Learning

Keywords

video understanding semantic space semantic embedding cross-modal alignment multi-grained learning video-language learning iterative approximation

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025