Altogether: Image Captioning via Re-aligning Alt-text

Hu Xu; Po-Yao Huang; Xiaoqing Tan; Ching-Feng Yeh; Jacob Kahn; Christine Jou; Gargi Ghosh; Omer Levy; Luke Zettlemoyer; Wen-tau Yih; Shang-Wen Li; Saining Xie; Christoph Feichtenhofer

2024 EMNLP EMNLP 2024

Altogether: Image Captioning via Re-aligning Alt-text

Abstract

AbstractThis paper focuses on creating synthetic data to improve the quality of image captions. Existing works typically have two shortcomings. First, they caption images from scratch, ignoring existing alt-text metadata, and second, lack transparency if the captioners’ training data (e.g. GPT) is unknown. In this paper, we study a principled approach Altogether based on the key idea to edit and re-align existing alt-texts associated with the images. To generate training data, we perform human annotation where annotators start with the existing alt-text and re-align it to the image content in multiple rounds, consequently constructing captions with rich visual concepts. This differs from prior work that carries out human annotation as a one-time description task solely based on images and annotator knowledge. We train a captioner on this data that generalizes the process of re-aligning alt-texts at scale. Our results show our Altogether approach leads to richer image captions that also improve text-to-image generation and zero-shot image classification tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hu Xu , Po-Yao Huang , Xiaoqing Tan , Ching-Feng Yeh , Jacob Kahn , Christine Jou , Gargi Ghosh , Omer Levy , Luke Zettlemoyer , Wen-tau Yih , Shang-Wen Li , Saining Xie , Christoph Feichtenhofer

Topics

Machine Learning > Application Areas > Data Augmentation Computer Vision > Generation > Image Captioning Natural Language Processing > Generation > Text Generation Deep Learning > Learning Types > Self-Supervised Learning Artificial Intelligence > Core AI > Multi-Modal Learning Deep Learning > Learning Types > Synthetic Data Generation

Keywords

zero-shot learning image captioning text-to-image generation synthetic data generation visual concept synthetic datum zero-shot classification

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024