LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Lianghui Zhu; Bin Ouyang; Yuxuan Zhang; Tianheng Cheng; Rui Hu; Haocheng Shen; Longjin Ran; Xiaoxin Chen; Li Yu; Wenyu Liu; Xinggang Wang

2026 AAAI AAAI 2026

LENS: Learning to Segment Anything with Unified Reinforced Reasoning

Abstract

Abstract Text-prompted image segmentation enables fine-grained visual understanding and is critical for applications such as human-computer interaction and robotics. However, existing supervised fine-tuning methods typically ignore explicit chain-of-thought (CoT) reasoning at test time, which limits their ability to generalize to unseen prompts and domains. To address this issue, we introduce LENS, a scalable reinforcement-learning framework that jointly optimizes the reasoning process and segmentation in an end-to-end manner. We propose unified reinforcement-learning rewards that span sentence-, box-, and segment-level cues, encouraging the model to generate informative CoT rationales while refining mask quality. Using a publicly available 3-billion-parameter vision–language model, i.e., Qwen2.5-VL-3B-Instruct, LENS achieves an average cIoU of 81.2% on the RefCOCO, RefCOCO+, and RefCOCOg benchmarks, outperforming the strong fine-tuned method, i.e., GLaMM, by up to 5.6%. These results demonstrate that RL-driven CoT reasoning significantly enhances text-prompted segmentation and offers a practical path toward more generalizable Segment Anything models (SAM).

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — text-prompted segmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Lianghui Zhu , Bin Ouyang , Yuxuan Zhang , Tianheng Cheng , Rui Hu , Haocheng Shen , Longjin Ran , Xiaoxin Chen , Li Yu , Wenyu Liu , Xinggang Wang

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > Interpretability Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning chain-of-thought reasoning vision-language model segment anything model text-prompted segmentation

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026