Efficient Machine Translation with Model Pruning and Quantization

Maximiliana Behnke; Nikolay Bogoychev; Alham Fikri Aji; Kenneth Heafield; Graeme Nail; Qianqian Zhu; Svetlana Tchistiakova; Jelmer van der Linde; Pinzhen Chen; Sidharth Kashyap; Roman Grundkiewicz

2021 EMNLP EMNLP 2021

Efficient Machine Translation with Model Pruning and Quantization

Abstract

AbstractWe participated in all tracks of the WMT 2021 efficient machine translation task: single-core CPU, multi-core CPU, and GPU hardware with throughput and latency conditions. Our submissions combine several efficiency strategies: knowledge distillation, a simpler simple recurrent unit (SSRU) decoder with one or two layers, lexical shortlists, smaller numerical formats, and pruning. For the CPU track, we used quantized 8-bit models. For the GPU track, we experimented with FP16 and 8-bit integers in tensorcores. Some of our submissions optimize for size via 4-bit log quantization and omitting a lexical shortlist. We have extended pruning to more parts of the network, emphasizing component- and block-level pruning that actually improves speed unlike coefficient-wise pruning.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — efficient machine translation

🐣 Hot Topic Early Bird — inference optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Maximiliana Behnke , Nikolay Bogoychev , Alham Fikri Aji , Kenneth Heafield , Graeme Nail , Qianqian Zhu , Svetlana Tchistiakova , Jelmer van der Linde , Pinzhen Chen , Sidharth Kashyap , Roman Grundkiewicz

Topics

Natural Language Processing > Applications > Machine Translation Machine Learning > Application Areas > Model Compression Natural Language Processing > Generation > Machine Translation Machine Learning > Learning Types > Knowledge Distillation Deep Learning > Models > Transformers Deep Learning > Optimization & Theory > Efficient Computing

Keywords

transformer architecture model quantization knowledge distillation machine translation model pruning throughput optimization inference optimization latency optimization efficient translation transformer model efficient machine translation

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021