2021 IJCAI IJCAI 2021

Improving Context-Aware Neural Machine Translation with Source-side Monolingual Documents

Abstract

Document context-aware machine translation remains challenging due to the lack of large-scale document parallel corpora. To make full use of source-side monolingual documents for context-aware NMT, we propose a Pre-training approach with Global Context (PGC). In particular, we first propose a novel self-supervised pre-training task, which contains two training objectives: (1) reconstructing the original sentence from a corrupted version; (2) generating a gap sentence from its left and right neighbouring sentences. Then we design a universal model for PGC which consists of a global context encoder, a sentence encoder and a decoder, with similar architecture to typical context-aware NMT models. We evaluate the effectiveness and generality of our pre-trained PGC model by adapting it to various downstream context-aware NMT models. Detailed experimentation on four different translation tasks demonstrates that our PGC approach significantly improves the translation performance of context-aware NMT. For example, based on the state-of-the-art SAN model, we achieve an averaged improvement of 1.85 BLEU scores and 1.59 Meteor scores on the four translation tasks.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio