VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Wencheng Zhu; Yuexin Wang; Hongxuan Li; Pengfei Zhu

2026 AAAI AAAI 2026

VTD-CLIP: Video-to-Text Discretization via Prompting CLIP

Abstract

Abstract Vision-language models bridge visual and linguistic understanding and have proven to be powerful for video recognition tasks. Existing methods primarily rely on parameter-efficient fine-tuning of pre-trained image-text models, suffering from limited interpretability and poor generalization due to inadequate temporal modeling. To address these, we propose a simple yet effective video-to-text discretization framework. Our approach leverages the frozen text encoder to build a visual codebook derived from video class labels, exploiting the many-to-one contrastive alignment between visual and textual embeddings in multimodal pretraining. This enables the transformation of temporal visual features into discrete textual tokens via feature lookups, yielding interpretable video representations through explicit video modeling. Then, to improve robustness against noisy or irrelevant frames, we introduce a confidence-aware fusion module that dynamically weights keyframes based on their semantic relevance, as measured by the codebook. Furthermore, we incorporate learnable text prompts to conduct adaptive codebook updates during training. Experiments on four datasets, including HMDB-51, UCF-101, Something-Something-v2, and Kinetics-400, validate the superiority of our approach, achieving competitive improvements over state-of-the-art approaches.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — text discretization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wencheng Zhu , Yuexin Wang , Hongxuan Li , Pengfei Zhu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Contrastive Learning Deep Learning > Architectures > Transformers

Keywords

contrastive learning temporal modeling video recognition vision-language model text discretization

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026