TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task

Kaixin Wu; Bojie Hu; Qi Ju

2021 EMNLP EMNLP 2021

TenTrans High-Performance Inference Toolkit for WMT2021 Efficiency Task

Abstract

AbstractThe paper describes the TenTrans’s submissions to the WMT 2021 Efficiency Shared Task. We explore training a variety of smaller compact transformer models using the teacher-student setup. Our model is trained by our self-developed open-source multilingual training platform TenTrans-Py. We also release an open-source high-performance inference toolkit for transformer models and the code is written in C++ completely. All additional optimizations are built on top of the inference engine including attention caching, kernel fusion, early-stop, and several other optimizations. In our submissions, the fastest system can translate more than 22,000 tokens per second with a single Tesla P4 while maintaining 38.36 BLEU on En-De newstest2019. Our trained models and more details are available in TenTrans-Decoding competition examples.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — compact transformer

🐣 Hot Topic Early Bird — inference optimization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kaixin Wu , Bojie Hu , Qi Ju

Topics

Machine Learning > Application Areas > Efficient Computing Deep Learning > Architectures > Transformers Natural Language Processing > Applications > Machine Translation Machine Learning > Application Areas > Model Compression Natural Language Processing > Generation > Machine Translation Deep Learning > Models > Transformers Deep Learning > Optimization & Theory > Efficient Computing

Keywords

model compression knowledge distillation machine translation neural machine translation kernel fusion inference optimization transformer model compact transformer attention caching

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021