Understanding Guided Image Captioning Performance across Domains

Edwin G. Ng; Bo Pang; Piyush Sharma; Radu Soricut

2021 EMNLP EMNLP 2021

Understanding Guided Image Captioning Performance across Domains

Abstract

AbstractImage captioning models generally lack the capability to take into account user interest, and usually default to global descriptions that try to balance readability, informativeness, and information overload. We present a Transformer-based model with the ability to produce captions focused on specific objects, concepts or actions in an image by providing them as guiding text to the model. Further, we evaluate the quality of these guided captions when trained on Conceptual Captions which contain 3.3M image-level captions compared to Visual Genome which contain 3.6M object-level captions. Counter-intuitively, we find that guided captions produced by the model trained on Conceptual Captions generalize better on out-of-domain data. Our human-evaluation results indicate that attempting in-the-wild guided image captioning requires access to large, unrestricted-domain training datasets, and that increased style diversity (even without increasing the number of unique tokens) is a key factor for improved performance.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — guided image captioning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Edwin G. Ng , Bo Pang , Piyush Sharma , Radu Soricut

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Application Areas > Domain Generalization Deep Learning > Architectures > Transformers Computer Vision > Generation > Image Captioning Natural Language Processing > Generation > Text Generation Natural Language Processing > Applications > Text Classification

Keywords

transformer architecture domain generalization transfer learning domain adaptation image captioning out-of-domain generalization out-of-domain datum visual genome transformer model guided image captioning conceptual caption guided captioning object-level caption

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021