2019
EMNLP
EMNLP 2019
Investigating the Effectiveness of BPE: The Power of Shorter Sequences
Abstract
AbstractByte-Pair Encoding (BPE) is an unsupervised sub-word tokenization technique, commonly used in neural machine translation and other NLP tasks. Its effectiveness makes it a de facto standard, but the reasons for this are not well understood. We link BPE to the broader family of dictionary-based compression algorithms and compare it with other members of this family. Our experiments across datasets, language pairs, translation models, and vocabulary size show that - given a fixed vocabulary size budget - the fewer tokens an algorithm needs to cover the test set, the better the translation (as measured by BLEU).
🌉
Interdisciplinary Bridge
— Deep Learning and Natural Language Processing
🧭
Keyword Pioneer
— compression algorithm
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio