Improving Image Captioning via Predicting Structured Concepts

Ting Wang; Weidong Chen; Yuanhe Tian; Yan Song; Zhendong Mao

2023 EMNLP EMNLP 2023

Improving Image Captioning via Predicting Structured Concepts

Abstract

AbstractHaving the difficulty of solving the semantic gap between images and texts for the image captioning task, conventional studies in this area paid some attention to treating semantic concepts as a bridge between the two modalities and improved captioning performance accordingly. Although promising results on concept prediction were obtained, the aforementioned studies normally ignore the relationship among concepts, which relies on not only objects in the image, but also word dependencies in the text, so that offers a considerable potential for improving the process of generating good descriptions. In this paper, we propose a structured concept predictor (SCP) to predict concepts and their structures, then we integrate them into captioning, so that enhance the contribution of visual signals in this task via concepts and further use their relations to distinguish cross-modal semantics for better description generation. Particularly, we design weighted graph convolutional networks (W-GCN) to depict concept relations driven by word dependencies, and then learns differentiated contributions from these concepts for following decoding process. Therefore, our approach captures potential relations among concepts and discriminatively learns different concepts, so that effectively facilitates image captioning with inherited information across modalities. Extensive experiments and their results demonstrate the effectiveness of our approach as well as each proposed module in this work.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — structured concept predictor

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ting Wang , Weidong Chen , Yuanhe Tian , Yan Song , Zhendong Mao

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Architectures > Graph Neural Networks Computer Vision > Generation > Image Captioning Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

image captioning multi-modal learning concept prediction graph convolutional network graph neural network visual semantics structured concept predictor weighted graph convolutional network structured concept prediction

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023