OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding

Tianrun Xu; Guanyu Chen; Ye Li; Yuxin Xi; Zeyu Mu; Ruichen Wang; Tianren Zhang; Haichuan Gao; Feng Chen

2025 ICCV ICCV 2025

OURO: A Self-Bootstrapped Framework for Enhancing Multimodal Scene Understanding

Abstract

Multimodal large models have made significant progress, yet fine-grained understanding of complex scenes remains a challenge. High-quality, large-scale vision-language datasets are essential for addressing this issue. However, existing methods often rely on labor-intensive manual annotations or closed-source models with optimal performance, making large-scale data collection costly. To overcome these limitations, we propose a self-bootstrapped training pipeline that leverages the model's own multimodal capabilities to recursively refine its understanding. By decomposing existing multimodal data into localized sub-regions and generating hierarchical scene descriptions and multi-faceted question-answer pairs, we construct a dataset based on 1.4M image-task instances. We further utilize this dataset to train the base model, significantly enhancing its ability to interpret complex visual scenes and perform various vision-related tasks. Our OURO model, fine-tuned on Qwen2-VL-7B-Instruct using LoRA, achieves substantial improvements over both the base model and similarly-sized counterparts across multiple multimodal benchmarks. Our self-bootstrapped training pipeline offers a novel paradigm for the continuous improvement of multimodal models. Code and datasets are available at https://github.com/tinnel123666888/OURO.git.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tianrun Xu , Guanyu Chen , Ye Li , Yuxin Xi , Zeyu Mu , Ruichen Wang , Tianren Zhang , Haichuan Gao , Feng Chen

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Transfer Learning Computer Vision > Analysis > Scene Understanding Artificial Intelligence > Core AI > Large Language Models Deep Learning > Models > Large Language Models

Keywords

transfer learning scene understanding self-supervised learning multimodal learning vision-language model visual language large language model

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025