Distributional Properties of Subword Regularization

Marco Cognetta; Vilém Zouhar; Naoaki Okazaki

2024 EMNLP EMNLP 2024

Distributional Properties of Subword Regularization

Abstract

AbstractSubword regularization, used widely in NLP, improves model performance by reducing the dependency on exact tokenizations, augmenting the training corpus, and exposing the model to more unique contexts during training. BPE and MaxMatch, two popular subword tokenization schemes, have stochastic dropout regularization variants. However, there has not been an analysis of the distributions formed by them.We show that these stochastic variants are heavily biased towards a small set of tokenizations per word. If the benefits of subword regularization are as mentioned, we hypothesize that biasedness artificially limits the effectiveness of these schemes. Thus, we propose an algorithm to uniformly sample tokenizations that we use as a drop-in replacement for the stochastic aspects of existing tokenizers, and find that it improves machine translation quality.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Mathematics & Optimization and Natural Language Processing

🧭 Keyword Pioneer — tokenization sampling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Marco Cognetta , Vilém Zouhar , Naoaki Okazaki

Topics

Machine Learning > Optimization & Theory > Theory Natural Language Processing > Resources & Methods > Text Representation Mathematics & Optimization > Mathematics > Statistics Machine Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Representation Learning Natural Language Processing > Applications > Text Processing

Keywords

machine translation uniform sampling byte-pair encoding byte pair encoding subword regularization bpe tokenization tokenization sampling distribution analysis stochastic dropout

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024