Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

Chuanhao Li; Zhen Li; Chenchen Jing; Yunde Jia; Yuwei Wu

2023 CVPR CVPR 2023

Exploring the Effect of Primitives for Compositional Generalization in Vision-and-Language

Abstract

Compositionality is one of the fundamental properties of human cognition (Fodor & Pylyshyn, 1988). Compositional generalization is critical to simulate the compositional capability of humans, and has received much attention in the vision-and-language (V&L) community. It is essential to understand the effect of the primitives, including words, image regions, and video frames, to improve the compositional generalization capability. In this paper, we explore the effect of primitives for compositional generalization in V&L. Specifically, we present a self-supervised learning based framework that equips V&L methods with two characteristics: semantic equivariance and semantic invariance. With the two characteristics, the methods understand primitives by perceiving the effect of primitive changes on sample semantics and ground-truth. Experimental results on two tasks: temporal video grounding and visual question answering, demonstrate the effectiveness of our framework.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — semantic equivariance

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Chuanhao Li , Zhen Li , Chenchen Jing , Yunde Jia , Yuwei Wu

Topics

Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Processing > Video Understanding Natural Language Processing > Generation > Language Modeling Computer Vision > Core AI > Multimodal Learning Deep Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

visual question answering self-supervised learning compositional generalization video grounding semantic equivariance

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023