Universal Sentence Representation Learning with Conditional Masked Language Model

Ziyi Yang; Yinfei Yang; Daniel Cer; Jax Law; Eric Darve

2021 EMNLP EMNLP 2021

Universal Sentence Representation Learning with Conditional Masked Language Model

Abstract

AbstractThis paper presents a novel training method, Conditional Masked Language Modeling (CMLM), to effectively learn sentence representations on large scale unlabeled corpora. CMLM integrates sentence representation learning into MLM training by conditioning on the encoded vectors of adjacent sentences. Our English CMLM model achieves state-of-the-art performance on SentEval, even outperforming models learned using supervised signals. As a fully unsupervised learning method, CMLM can be conveniently extended to a broad range of languages and domains. We find that a multilingual CMLM model co-trained with bitext retrieval (BR) and natural language inference (NLI) tasks outperforms the previous state-of-the-art multilingual models by a large margin, e.g. 10% improvement upon baseline models on cross-lingual semantic search. We explore the same language bias of the learned representations, and propose a simple, post-training and model agnostic approach to remove the language identifying information from the representation while still retaining sentence semantics.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — cross-lingual retrieval

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Ziyi Yang , Yinfei Yang , Daniel Cer , Jax Law , Eric Darve

Topics

Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Text Representation Natural Language Processing > Resources & Methods > Transfer Learning Natural Language Processing > Resources & Methods > Language Modeling Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Models > Transformers Deep Learning > Learning Types > Representation Learning Artificial Intelligence > Core AI > Natural Language Processing

Keywords

natural language inference cross-lingual transfer masked language model masked language modeling sentence representation multilingual model cross-lingual retrieval bitext retrieval cross-lingual semantic search

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021