Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning

Yuheng Zha; Kun Zhou; Yujia Wu; Yushu Wang; Jie Feng; Zhi Xu; Shibo Hao; Zhengzhong Liu; Eric P. Xing; Zhiting Hu

2026 AAAI AAAI 2026

Vision-G1: Towards General Reasoning Vision-Language Models via Reinforcement Learning

Abstract

Abstract Recent vision-language models (VLMs) show strong reasoning capabilities through training with reinforcement learning from verifiable rewards (RLVR). Despite their impressive capabilities, current VLMs focus on a limited range of reasoning tasks, such as mathematical and logical reasoning, due to the lack of readily available verifiable reward data in broader domains. As a result, these models struggle to generalize their reasoning abilities to the wide variety of challenges encountered in real-world environments. To address this limitation, we collect and assemble a comprehensive RL-ready visual reasoning training dataset encompassing 46 datasets across 13 dimensions of 5 domains, covering a wide range of realistic scenarios such as infographic reasoning, mathematical reasoning, spatial reasoning, and general science reasoning. Based on this dataset, we propose an influence function-based data filtering strategy and a multi-round data curriculum method to iteratively strengthen general visual reasoning abilities. Using this approach, we train a general reasoning VLM, namely Vision-G1. Our 7B model achieves state-of-the-art performance across nine visual reasoning benchmarks, surpassing previous similar-sized VLMs and even GPT-4o and Gemini-1.5 Flash.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuheng Zha , Kun Zhou , Yujia Wu , Yushu Wang , Jie Feng , Zhi Xu , Shibo Hao , Zhengzhong Liu , Eric P. Xing , Zhiting Hu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Reinforcement Learning

Keywords

reinforcement learning visual reasoning vision-language model multimodal reasoning verifiable reward

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026