Conical Visual Concentration for Efficient Large Vision-Language Models

Long Xing; Qidong Huang; Xiaoyi Dong; Jiajie Lu; Pan Zhang; Yuhang Zang; Yuhang Cao; Conghui He; Jiaqi Wang; Feng Wu; Dahua Lin

2025 CVPR CVPR 2025

Conical Visual Concentration for Efficient Large Vision-Language Models

Abstract

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers.To this end, we propose ViCo, a conical-style visual concentration strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that ViCo can achieve over 40% training time reduction and 55% inference FLOPs acceleration on leading LVLMs like LLaVA-NeXT, maintaining comparable multi-modal performance. Besides, ViCo can also serve as a plug-and-play strategy to accelerate inference in a free way, with better performance and lower inference cost than counterparts.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — computational efficiency optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Long Xing , Qidong Huang , Xiaoyi Dong , Jiajie Lu , Pan Zhang , Yuhang Zang , Yuhang Cao , Conghui He , Jiaqi Wang , Feng Wu , Dahua Lin

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Efficient Computing Deep Learning > Models > Large Language Models Computer Vision > Core AI > Efficient Computing Deep Learning > Optimization & Theory > Efficient Computing Deep Learning > Learning Types > Multimodal Learning Deep Learning > Models > Vision-Language Models

Keywords

multimodal learning efficient computing vision-language model inference acceleration visual token pruning token reduction vision language large language model computational efficiency optimization conical concentration strategy

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025