TOP-RL: Task-Optimized Progressive Token Pruning with Reinforcement Learning for Vision Language Models

Hengyi Wang; Weiying Xie; Hui Jiang; Yaotao Wei; Kai Jiang; Mingxiang Cao; Chenhe Hao; Leyuan Fang

2026 AAAI AAAI 2026

TOP-RL: Task-Optimized Progressive Token Pruning with Reinforcement Learning for Vision Language Models

Abstract

Abstract In recent years, Large Vision-Language Models (LVLMs) have significantly advanced multimodal tasks. However, their inference requires intensive processing of numerous visual tokens and incurs substantial computational overhead. Existing methods typically compress visual tokens either at the input stage or in early model layers, ignoring variations across tasks and depths. To address these limitations, we introduce TOP-RL, a Task-Optimized Progressive token pruning framework based on Reinforcement Learning. TOP-RL formulates visual token pruning as a multi-stage Markov Decision Process (MDP). It employs an agent trained with dense and fine-grained reward signals to progressively generate differentiable binary masks. This enables TOP-RL to adaptively select crucial visual tokens tailored to each task, effectively balancing accuracy and computational efficiency. Extensive experiments on leading multimodal datasets and advanced LVLMs validate that TOP-RL effectively learns task-optimized pruning policies, significantly boosting inference efficiency while preserving robust performance. For instance, LLaVA-NeXT equipped with TOP-RL achieves a 1.9x speedup in inference time and a 9.3x reduction in FLOPs, with 96% performance preserved.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hengyi Wang , Weiying Xie , Hui Jiang , Yaotao Wei , Kai Jiang , Mingxiang Cao , Chenhe Hao , Leyuan Fang

Topics

Artificial Intelligence > Core AI > Model Compression Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Optimization & Theory > Optimization

Keywords

model compression reinforcement learning multimodal learning vision language model token pruning

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026