Wine is not v i n. On the Compatibility of Tokenizations across Languages

Antonis Maronikolakis; Philipp Dufter; Hinrich Schütze

2021 EMNLP EMNLP 2021

Wine is not v i n. On the Compatibility of Tokenizations across Languages

Abstract

AbstractThe size of the vocabulary is a central design choice in large pretrained language models, with respect to both performance and memory requirements. Typically, subword tokenization algorithms such as byte pair encoding and WordPiece are used. In this work, we investigate the compatibility of tokenizations for multilingual static and contextualized embedding spaces and propose a measure that reflects the compatibility of tokenizations across languages. Our goal is to prevent incompatible tokenizations, e.g., “wine” (word-level) in English vs. “v i n” (character-level) in French, which make it hard to learn good multilingual semantic representations. We show that our compatibility measure allows the system designer to create vocabularies across languages that are compatible – a desideratum that so far has been neglected in multilingual models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Foundation Models

🧭 Keyword Pioneer — multilingual tokenization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Antonis Maronikolakis , Philipp Dufter , Hinrich Schütze

Topics

Machine Learning > Core Methods > Embedding Learning Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Multilingual NLP Artificial Intelligence > Core AI > Language Deep Learning > Models > Foundation Models Deep Learning > Learning Types > Multi-Lingual Learning

Keywords

subword tokenization multilingual embedding byte pair encoding multilingual tokenization

Download PDF

Related papers

Continual Learning in Multilingual NMT via Language-Specific Embeddings 2021

MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents 2021

Efficient Multi-Task Auxiliary Learning: Selecting Auxiliary Data by Feature Similarity 2021

Neural Machine Translation with Heterogeneous Topic Knowledge Embeddings 2021

Semantics-Preserved Data Augmentation for Aspect-Based Sentiment Analysis 2021