A-Cap: Anticipation Captioning With Commonsense Knowledge

Duc Minh Vo; Quoc-An Luong; Akihiro Sugimoto; Hideki Nakayama

2023 CVPR CVPR 2023

A-Cap: Anticipation Captioning With Commonsense Knowledge

Abstract

Humans possess the capacity to reason about the future based on a sparse collection of visual cues acquired over time. In order to emulate this ability, we introduce a novel task called Anticipation Captioning, which generates a caption for an unseen oracle image using a sparsely temporally-ordered set of images. To tackle this new task, we propose a model called A-CAP, which incorporates commonsense knowledge into a pre-trained vision-language model, allowing it to anticipate the caption. Through both qualitative and quantitative evaluations on a customized visual storytelling dataset, A-CAP outperforms other image captioning methods and establishes a strong baseline for anticipation captioning. We also address the challenges inherent in this task.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — anticipation captioning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Duc Minh Vo , Quoc-An Luong , Akihiro Sugimoto , Hideki Nakayama

Topics

Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Generation > Image Captioning Computer Vision > Processing > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Vision-Language Models

Keywords

temporal reasoning commonsense knowledge image captioning visual storytelling vision-language model anticipation captioning

Download PDF

Related papers

CORA: Adapting CLIP for Open-Vocabulary Detection With Region Prompting and Anchor Pre-Matching 2023

3DAvatarGAN: Bridging Domains for Personalized Editable Avatars 2023

Physics-Driven Diffusion Models for Impact Sound Synthesis From Videos 2023

Transductive Few-Shot Learning With Prototype-Based Label Propagation by Iterative Graph Refinement 2023

EXIF As Language: Learning Cross-Modal Associations Between Images and Camera Metadata 2023