TIME: Text and Image Mutual-Translation Adversarial Networks

Bingchen Liu; Kunpeng Song; Yizhe Zhu; Gerard de Melo; Ahmed Elgammal

2021 AAAI AAAI 2021

TIME: Text and Image Mutual-Translation Adversarial Networks

Abstract

Abstract Focusing on text-to-image (T2I) generation, we propose Text and Image Mutual-Translation Adversarial Networks (TIME), a lightweight but effective model that jointly learns a T2I generator G and an image captioning discriminator D under the Generative Adversarial Network framework. While previous methods tackle the T2I problem as a uni-directional task and use pre-trained language models to enforce the image--text consistency, TIME requires neither extra modules nor pre-training. We show that the performance of G can be boosted substantially by training it jointly with D as a language model. Specifically, we adopt Transformers to model the cross-modal connections between the image features and word embeddings, and design an annealing conditional hinge loss that dynamically balances the adversarial learning. In our experiments, TIME achieves state-of-the-art (SOTA) performance on the CUB dataset (Inception Score of 4.91 and Fréchet Inception Distance of 14.3 on CUB), and shows promising performance on MS-COCO dataset on image captioning and downstream vision-language tasks.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — image-text consistency

🐣 Hot Topic Early Bird — text-to-image generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Bingchen Liu , Kunpeng Song , Yizhe Zhu , Gerard de Melo , Ahmed Elgammal

Topics

Machine Learning > Learning Types > Adversarial Learning Computer Vision > Generation > Image Generation Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Models > Transformers Deep Learning > Learning Types > Generative Models

Keywords

multimodal learning image captioning text-to-image generation generative adversarial network image-text consistency

Download PDF

Related papers

Contextual Conditional Reasoning 2021

Attention Beam: An Image Captioning Approach (Student Abstract) 2021

Movie Summarization via Sparse Graph Construction 2021

Text Analysis for Understanding Symptoms of Social Anxiety in Student Veterans 2021

ERNIE-ViL: Knowledge Enhanced Vision-Language Representations through Scene Graphs 2021