Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations

Zhan Shi; Yilin Shen; Hongxia Jin; Xiaodan Zhu

2022 AAAI AAAI 2022

Improving Zero-Shot Phrase Grounding via Reasoning on External Knowledge and Spatial Relations

Abstract

Abstract Phrase grounding is a multi-modal problem that localizes a particular noun phrase in an image referred to by a text query. In the challenging zero-shot phrase grounding setting, the existing state-of-the-art grounding models have limited capacity in handling the unseen phrases. Humans, however, can ground novel types of objects in images with little effort, significantly benefiting from reasoning with commonsense. In this paper, we design a novel phrase grounding architecture that builds multi-modal knowledge graphs using external knowledge and then performs graph reasoning and spatial relation reasoning to localize the referred nouns phrases. We perform extensive experiments on different zero-shot grounding splits sub-sampled from the Flickr30K Entity and Visual Genome dataset, demonstrating that the proposed framework is orthogonal to backbone image encoders and outperforms the baselines by 2~3% in accuracy, resulting in a significant improvement under the standard evaluation metrics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Knowledge & Reasoning

🐣 Hot Topic Early Bird — spatial reasoning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhan Shi , Yilin Shen , Hongxia Jin , Xiaodan Zhu

Topics

Deep Learning > Architectures > Graph Neural Networks Knowledge & Reasoning > Reasoning > Causal Inference Artificial Intelligence > Core AI > Reasoning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

zero-shot learning multi-modal learning knowledge graph spatial reasoning commonsense reasoning spatial relation phrase grounding

Download PDF

Related papers

Dynamic Spatial Propagation Network for Depth Completion 2022

FedFR: Joint Optimization Federated Framework for Generic and Personalized Face Recognition 2022

Memory-Guided Semantic Learning Network for Temporal Sentence Grounding 2022

AnchorFace: Boosting TAR@FAR for Practical Face Recognition 2022

Parallel and High-Fidelity Text-to-Lip Generation 2022