2026 AAAI AAAI 2026

OneLIP: Unlocking and Improving Long-Text Representations of CLIP via One-Stage Adaptation

Abstract

Abstract Contrastive Language-Image Pretraining (CLIP) has demonstrated impressive generalization on vision-language tasks by aligning images and short texts. However, its inherent 77-token length limits the capacity of capturing complex semantics in long captions. Existing long-text adaptations for CLIP typically rely on either multi-stage training or truncation-based alignment, both inevitably resulting in semantic degradation and cumbersome tuning. Therefore, we propose OneLIP, a unified framework that extends CLIP to understand long captions within a single training stage, eliminating the need for brittle truncation or multi-stage pipelines. OneLIP addresses semantic degradation by introducing two key innovations: (1) Token Refinement and Importance-guided Modeling (TRIM) module, which selects and refines informative tokens via SVD-based contribution scoring and cross-modal relevance modeling; (2) Per-sample Online Hard Negative Mining (PO-HNM) strategy dynamically maintains sample-specific negatives based on dual-consistency difficulty tracking, which is superior in long-text scenarios where key semantics are distributed in scattered positions. Extensive experiments on long-text image retrieval, short-text image retrieval, zero-shot classification, and text-to-image generation demonstrate OneLIP's robustness and versatility across diverse input lengths, offering a faithful solution for long-text representation learning of CLIP.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio