Learning Deep Transformer Models for Machine Translation

Qiang Wang; Bei Li; Tong Xiao; Jingbo Zhu; Changliang Li; Derek F. Wong; Lidia S. Chao

2019 ACL ACL 2019

Learning Deep Transformer Models for Machine Translation

Abstract

AbstractTransformer is the state-of-the-art model in recent machine translation evaluations. Two strands of research are promising to improve models of this kind: the first uses wide networks (a.k.a. Transformer-Big) and has been the de facto standard for development of the Transformer system, and the other uses deeper language representation but faces the difficulty arising from learning deep networks. Here, we continue the line of research on the latter. We claim that a truly deep Transformer model can surpass the Transformer-Big counterpart by 1) proper use of layer normalization and 2) a novel way of passing the combination of previous layers to the next. On WMT’16 English-German and NIST OpenMT’12 Chinese-English tasks, our deep system (30/25-layer encoder) outperforms the shallow Transformer-Big/Base baseline (6-layer encoder) by 0.4-2.4 BLEU points. As another bonus, the deep model is 1.6X smaller in size and 3X faster in training than Transformer-Big.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — deep transformer

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Qiang Wang , Bei Li , Tong Xiao , Jingbo Zhu , Changliang Li , Derek F. Wong , Lidia S. Chao

Topics

Machine Learning > Optimization & Theory > Neural Network Optimization Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Machine Translation Deep Learning > Learning Types > Representation Learning

Keywords

machine translation deep learning neural network optimization layer normalization deep network deep transformer neural network encoder decoder

Download PDF

Related papers

What do phone embeddings learn about Phonology? 2019

Unsupervised Morphological Segmentation for Low-Resource Polysynthetic Languages 2019

Understanding Undesirable Word Embedding Associations 2019

Inferential Machine Comprehension: Answering Questions by Recursively Deducing the Evidence Chain from Text 2019

Domain Adaptation of Neural Machine Translation by Lexicon Induction 2019