Generalized Decoding for Pixel, Image, and Language

Xueyan Zou; Zi-Yi Dou; Jianwei Yang; Zhe Gan; Linjie Li; Chunyuan Li; Xiyang Dai; Harkirat Behl; Jianfeng Wang; Lu Yuan; Nanyun Peng; Lijuan Wang; Yong Jae Lee; Jianfeng Gao

2023 CVPR CVPR 2023

Generalized Decoding for Pixel, Image, and Language

Abstract

We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decoder takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition. Code, demo, video and visualization are available at: https://x-decoder-vl.github.io.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Natural Language Processing

🧭 Keyword Pioneer — semantic queries

🐣 Hot Topic Early Bird — open-vocabulary segmentation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xueyan Zou , Zi-Yi Dou , Jianwei Yang , Zhe Gan , Linjie Li , Chunyuan Li , Xiyang Dai , Harkirat Behl , Jianfeng Wang , Lu Yuan , Nanyun Peng , Lijuan Wang , Yong Jae Lee , Jianfeng Gao

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Generation > Image Captioning Computer Vision > Processing > Image Segmentation Natural Language Processing > Generation > Text Generation Computer Vision > Processing > Semantic Segmentation

Keywords

zero-shot learning semantic segmentation image segmentation referring expression open-vocabulary segmentation vision-language model semantic queries referring segmentation

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023