2026 AAAI AAAI 2026

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Abstract

Abstract Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing methods primarily rely on parameter-efficient fine-tuning of pre-trained image-text models, suffering from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our approach leverages the frozen text encoder to build a visual codebook derived from video class labels, exploiting the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This enables the transformation of temporal visual features into discrete textual tokens via feature lookups, yielding interpretable video representations through explicit video modeling. Then, to improve robustness against noisy or irrelevant frames, we introduce a confidence-aware fusion module that dynamically weights keyframes based on their semantic relevance, as measured by the codebook. Furthermore, we incorporate learnable text prompts to conduct adaptive codebook updates during training. Experiments on four datasets, including HMDB-51, UCF-101, Something-Something-v2, and Kinetics-400, validate the superiority of our approach, achieving competitive improvements over state-of-the-art approaches.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning
🧭 Keyword Pioneer — text discretization
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio