PixelLM: Pixel Reasoning with Large Multimodal Model

Zhongwei Ren; Zhicheng Huang; Yunchao Wei; Yao Zhao; Dongmei Fu; Jiashi Feng; Xiaojie Jin

2024 CVPR CVPR 2024

PixelLM: Pixel Reasoning with Large Multimodal Model

Abstract

While large multimodal models (LMMs) have achieved remarkable progress generating pixel-level masks for image reasoning tasks involving multiple open-world targets remains a challenge. To bridge this gap we introduce PixelLM an effective and efficient LMM for pixel-level reasoning and understanding. Central to PixelLM are a novel lightweight pixel decoder and a comprehensive segmentation codebook. The decoder efficiently produces masks from the hidden embeddings of the codebook tokens which encode detailed target-relevant information. With this design PixelLM harmonizes with the structure of popular LMMs and avoids the need for additional costly segmentation models. Furthermore we propose a token fusion method to enhance the model's ability to differentiate between multiple targets leading to substantially improved mask quality. To advance research in this area we construct MUSE a high-quality multi-target reasoning segmentation benchmark. PixelLM excels across various pixel-level image reasoning and understanding tasks outperforming well-established methods in multiple benchmarks including MUSE and multi-referring segmentation. Comprehensive ablations confirm the efficacy of each proposed component. All code models and datasets will be publicly available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐣 Hot Topic Early Bird — image understanding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhongwei Ren , Zhicheng Huang , Yunchao Wei , Yao Zhao , Dongmei Fu , Jiashi Feng , Xiaojie Jin

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Deep Learning > Models > Generative Models Computer Vision > Processing > Semantic Segmentation

Keywords

semantic segmentation image segmentation large multimodal model image understanding mask generation visual understanding pixel-level reasoning

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024