Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Kanchana Ranasinghe; Satya Narayan Shukla; Omid Poursaeed; Michael S. Ryoo; Tsung-Yu Lin

2024 CVPR CVPR 2024

Learning to Localize Objects Improves Spatial Reasoning in Visual-LLMs

Abstract

Integration of Large Language Models (LLMs) into visual domain tasks resulting in visual-LLMs (V-LLMs) has enabled exceptional performance in vision-language tasks particularly for visual question answering (VQA). However existing V-LLMs (e.g. BLIP-2 LLaVA) demonstrate weak spatial reasoning and localization awareness. Despite generating highly descriptive and elaborate textual answers these models fail at simple tasks like distinguishing a left vs right location. In this work we explore how image-space coordinate based instruction fine-tuning objectives could inject spatial awareness into V-LLMs. We discover optimal coordinate representations data-efficient instruction fine-tuning objectives and pseudo-data generation strategies that lead to improved spatial awareness in V-LLMs. Additionally our resulting model improves VQA across image and video domains reduces undesired hallucination and generates better contextual object descriptions. Experiments across 5 vision-language tasks involving 14 different datasets establish the clear performance improvements achieved by our proposed framework.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Kanchana Ranasinghe , Satya Narayan Shukla , Omid Poursaeed , Michael S. Ryoo , Tsung-Yu Lin

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Large Language Models Natural Language Processing > Applications > Visual Question Answering Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Transfer Learning

Keywords

visual question answering visual grounding object localization instruction fine-tuning visual-language model spatial reasoning visual language model

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024