An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers

Valentin Hofmann; Hinrich Schuetze; Janet Pierrehumbert

2022 ACL ACL 2022

An Embarrassingly Simple Method to Mitigate Undesirable Properties of Pretrained Language Model Tokenizers

Abstract

AbstractWe introduce FLOTA (Few Longest Token Approximation), a simple yet effective method to improve the tokenization of pretrained language models (PLMs). FLOTA uses the vocabulary of a standard tokenizer but tries to preserve the morphological structure of words during tokenization. We evaluate FLOTA on morphological gold segmentations as well as a text classification task, using BERT, GPT-2, and XLNet as example PLMs. FLOTA leads to performance gains, makes inference more efficient, and enhances the robustness of PLMs with respect to whitespace noise.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio