LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Zhang Li; Biao Yang; Qiang Liu; Shuo Zhang; Zhiyin Ma; Liang Yin; Linger Deng; Yabo Sun; Yuliang Liu; Xiang Bai

2025 ICCV ICCV 2025

LIRA: Inferring Segmentation in Large Multi-modal Models with Local Interleaved Region Assistance

Abstract

While large multi-modal models (LMMs) demonstrate promising capabilities in segmentation and comprehension, they still struggle with two limitations: inaccurate segmentation and hallucinated comprehension. These challenges stem primarily from constraints in weak visual comprehension and a lack of fine-grained perception. To alleviate these limitations, we propose LIRA, a framework that capitalizes on the complementary relationship between visual comprehension and segmentation via two key components: (1) Semantic-Enhanced Feature Extractor (SEFE) improves object attribute inference by fusing semantic and pixel-level features, leading to more accurate segmentation; (2) Interleaved Local Visual Coupling (ILVC) autoregressively generates local descriptions after extracting local features based on segmentation masks, offering fine-grained supervision to mitigate hallucinations. Furthermore, we find that the precision of object segmentation is positively correlated with the latent related semantics of the <seg> token. To quantify this relationship and the model's potential semantic inferring ability, we introduce the Attributes Evaluation (AttrEval) dataset. Our experiments show that LIRA achieves state-of-the-art performance in both segmentation and comprehension tasks. Code will be available at https://github.com/echo840/LIRA.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhang Li , Biao Yang , Qiang Liu , Shuo Zhang , Zhiyin Ma , Liang Yin , Linger Deng , Yabo Sun , Yuliang Liu , Xiang Bai

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Processing > Image Segmentation Machine Learning > Learning Types > Multi-Modal Learning

Keywords

semantic segmentation image segmentation multimodal learning hallucination mitigation visual comprehension large multi-modal model fine-grained perception

Download PDF

Related papers

MA-CIR: A Multimodal Arithmetic Benchmark for Composed Image Retrieval 2025

SimMLM: A Simple Framework for Multi-modal Learning with Missing Modality 2025

MonSTeR: a Unified Model for Motion, Scene, Text Retrieval 2025

ASGS: Single-Domain Generalizable Open-Set Object Detection via Adaptive Subgraph Searching 2025

Robust Dataset Condensation using Supervised Contrastive Learning 2025