Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Yuan-Hong Liao; Rafid Mahmood; Sanja Fidler; David Acuna

2025 CVPR CVPR 2025

Can Large Vision-Language Models Correct Semantic Grounding Errors By Themselves?

Abstract

Improving semantic grounding in Vision-Language Models (VLMs) often involves collecting domain-specific training data, refining the network architectures, or modifying the training recipes. In this work, we venture into an orthogonal direction and explore self-correction in VLMs focusing on semantic grounding. We find that VLMs can correct their own semantic grounding mistakes when properly prompted and framed for the task, without any fine-tuning or even access to oracle feedback. We also introduce a self-correction framework in an iterative setting which consistently improves performance across all models investigated. Overall, we show that iterative self-correction consistently improves VLM performance in semantic grounding by up to 8.4 accuracy points across all models investigated, without requiring fine-tuning, additional architectural changes, or external data. Our exploration of self-correction also reveals that, even after several rounds of feedback, strong models like GPT-4V and GPT-4o retain limited capability in leveraging oracle feedback, suggesting promising directions for further research.

❓ The Questioner

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision

🧭 Keyword Pioneer — iterative self-correction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yuan-Hong Liao , Rafid Mahmood , Sanja Fidler , David Acuna

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Reasoning Computer Vision > Core AI > Computer Vision

Keywords

iterative refinement vision-language model error correction semantic grounding iterative self-correction zero-shot correction

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025