ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Fei Yu; Jiji Tang; Weichong Yin; Yu Sun; Hao Tian; Hua Wu; Haifeng Wang

2021 AAAI AAAI 2021

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs

Abstract

Abstract We propose a knowledge-enhanced approach, ERNIE-ViL, which incorporates structured knowledge obtained from scene graphs to learn joint representations of vision-language. ERNIE-ViL tries to build the detailed semantic connections (objects, attributes of objects and relationships between objects) across vision and language, which are essential to vision-language cross-modal tasks. Utilizing scene graphs of visual scenes, ERNIE-ViL constructs Scene Graph Prediction tasks, i.e., Object Prediction, Attribute Prediction and Relationship Prediction tasks in the pre-training phase. Specifically, these prediction tasks are implemented by predicting nodes of different types in the scene graph parsed from the sentence. Thus, ERNIE-ViL can learn the joint representations characterizing the alignments of the detailed semantics across vision and language. After pre-training on large scale image-text aligned datasets, we validate the effectiveness of ERNIE-ViL on 5 cross-modal downstream tasks. ERNIE-ViL achieves state-of-the-art performances on all these tasks and ranks the first place on the VCR leaderboard with an absolute improvement of 3.7%.

🌉 Interdisciplinary Bridge — Deep Learning and Knowledge & Reasoning and Machine Learning

🧭 Keyword Pioneer — vision-language representation learning

🐣 Hot Topic Early Bird — vision language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Fei Yu , Jiji Tang , Weichong Yin , Yu Sun , Hao Tian , Hua Wu , Haifeng Wang

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Techniques > Pretraining Knowledge & Reasoning > Representation > Knowledge Graphs Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

scene graph vision language model knowledge enhancement visual commonsense reasoning vision-language representation learning cross modal scene graph prediction knowledge enhanced

Download PDF

Related papers

Contextual Conditional Reasoning 2021

Attention Beam: An Image Captioning Approach (Student Abstract) 2021

Movie Summarization via Sparse Graph Construction 2021

Text Analysis for Understanding Symptoms of Social Anxiety in Student Veterans 2021

Safety Assurance for Systems with Machine Learning Components 2021