2022
NAACL
NAACL 2022
All You May Need for VQA are Image Captions
Abstract
AbstractVisual Question Answering (VQA) has benefited from increasingly sophisticated models, but has not enjoyed the same level of engagement in terms of data creation. In this paper, we propose a method that automatically derives VQA examples at volume, by leveraging the abundance of existing image-caption annotations combined with neural models for textual question generation. We show that the resulting data is of high-quality. VQA models trained on our data improve state-of-the-art zero-shot accuracy by double digits and achieve a level of robustness that lacks in the same model trained on human-annotated VQA data.
🌉
Interdisciplinary Bridge
— Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing
🧭
Keyword Pioneer
— zero-shot accuracy
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio
Authors
Topics
Artificial Intelligence > Core AI > Multimodal Learning
Machine Learning > Learning Types > Zero-Shot Learning
Computer Vision > Generation > Image Captioning
Natural Language Processing > Generation > Text Generation
Natural Language Processing > Applications > Question Answering
Deep Learning > Learning Types > Self-Supervised Learning
Natural Language Processing > Applications > Visual Question Answering