Query-based Image Captioning from Multi-context 360cdegree Images

Koki Maeda; Shuhei Kurita; Taiki Miyanishi; Naoaki Okazaki

2023 EMNLP EMNLP 2023

Query-based Image Captioning from Multi-context 360cdegree Images

Abstract

AbstractA 360-degree image captures the entire scene without the limitations of a camera’s field of view, which makes it difficult to describe all the contexts in a single caption. We propose a novel task called Query-based Image Captioning (QuIC) for 360-degree images, where a query (words or short phrases) specifies the context to describe. This task is more challenging than the conventional image captioning task, which describes salient objects in images, as it requires fine-grained scene understanding to select the contents consistent with user’s intent based on the query. We construct a dataset for the new task that comprises 3,940 360-degree images and 18,459 pairs of queries and captions annotated manually. Experiments demonstrate that fine-tuning image captioning models further on our dataset can generate more diverse and controllable captions from multiple contexts of 360-degree images.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — query-based captioning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Koki Maeda , Shuhei Kurita , Taiki Miyanishi , Naoaki Okazaki

Topics

Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Scene Understanding Computer Vision > Generation > Image Captioning Deep Learning > Learning Types > Multi-Modal Learning

Keywords

scene understanding multimodal learning image captioning multi-modal learning 360-degree image query-based captioning fine-grained scene understanding

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023