2022
EMNLP
EMNLP 2022
Do Decoding Algorithms Capture Discourse Structure in Multi-Modal Tasks? A Case Study of Image Paragraph Generation
Abstract
AbstractThis paper describes insights into how different inference algorithms structure discourse in image paragraphs. We train a multi-modal transformer and compare 11 variations of decoding algorithms. We propose to evaluate image paragraphs not only with standard automatic metrics, but also with a more extensive, “under the hood” analysis of the discourse formed by sentences. Our results show that while decoding algorithms can be unfaithful to the reference texts, they still generate grounded descriptions, but they also lack understanding of the discourse structure and differ from humans in terms of attentional structure over images.
❓
The Questioner
🌉
Interdisciplinary Bridge
— Computer Vision and Deep Learning and Machine Learning and Natural Language Processing
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Machine Learning > Core Methods > Representation Learning
Machine Learning > Learning Types > Self-Supervised Learning
Deep Learning > Architectures > Transformers
Computer Vision > Generation > Image Captioning
Natural Language Processing > Generation > Text Generation
Natural Language Processing > Applications > Text Generation
Computer Vision > Core AI > Multimodal Learning
Deep Learning > Models > Transformers
Deep Learning > Learning Types > Multi-Modal Learning