Visual Prediction Improves Zero-Shot Cross-Modal Machine Translation

Tosho Hirasawa; Emanuele Bugliarello; Desmond Elliott; Mamoru Komachi

2023 EMNLP EMNLP 2023

Visual Prediction Improves Zero-Shot Cross-Modal Machine Translation

Abstract

AbstractMultimodal machine translation (MMT) systems have been successfully developed in recent years for a few language pairs. However, training such models usually requires tuples of a source language text, target language text, and images. Obtaining these data involves expensive human annotations, making it difficult to develop models for unseen text-only language pairs. In this work, we propose the task of zero-shot cross-modal machine translation aiming to transfer multimodal knowledge from an existing multimodal parallel corpus into a new translation direction. We also introduce a novel MMT model with a visual prediction network to learn visual features grounded on multimodal parallel data and provide pseudo-features for text-only language pairs. With this training paradigm, our MMT model outperforms its text-only counterpart. In our extensive analyses, we show that (i) the selection of visual features is important, and (ii) training on image-aware translations and being grounded on a similar language pair are mandatory.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — image-aware translation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tosho Hirasawa , Emanuele Bugliarello , Desmond Elliott , Mamoru Komachi

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Artificial Intelligence > Learning Paradigms > Zero-Shot Learning Natural Language Processing > Generation > Machine Translation Computer Vision > Core AI > Multimodal Learning

Keywords

zero-shot learning machine translation multimodal learning cross-modal learning visual grounding cross-modal transfer visual feature visual prediction multimodal machine translation image-aware translation

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023