mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Rasmus Kær Jørgensen; Mareike Hartmann; Xiang Dai; Desmond Elliott

2021 EMNLP EMNLP 2021

mDAPT: Multilingual Domain Adaptive Pretraining in a Single Model

Abstract

AbstractDomain adaptive pretraining, i.e. the continued unsupervised pretraining of a language model on domain-specific text, improves the modelling of text for downstream tasks within the domain. Numerous real-world applications are based on domain-specific text, e.g. working with financial or biomedical documents, and these applications often need to support multiple languages. However, large-scale domain-specific multilingual pretraining data for such scenarios can be difficult to obtain, due to regulations, legislation, or simply a lack of language- and domain-specific text. One solution is to train a single multilingual model, taking advantage of the data available in as many languages as possible. In this work, we explore the benefits of domain adaptive pretraining with a focus on adapting to multiple languages within a specific domain. We propose different techniques to compose pretraining corpora that enable a language model to both become domain-specific and multilingual. Evaluation on nine domain-specific datasets—for biomedical named entity recognition and financial sentence classification—covering seven different languages show that a single multilingual domain-specific model can outperform the general multilingual model, and performs close to its monolingual counterpart. This finding holds across two different pretraining methods, adapter-based pretraining and full model pretraining.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🧭 Keyword Pioneer — multilingual domain adaptive pretraining

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rasmus Kær Jørgensen , Mareike Hartmann , Xiang Dai , Desmond Elliott

Topics

Machine Learning > Application Areas > Domain Adaptation Natural Language Processing > Resources & Methods > Multilingual NLP Natural Language Processing > Applications > Named Entity Recognition Deep Learning > Learning Types > Transfer Learning Machine Learning > Learning Types > Multi-Lingual Learning

Keywords

text classification named entity recognition language model sentence classification multilingual model multilingual domain adaptive pretraining adapter-based pretraining domain adaptive pretraining

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021