2026
EACL
EACL 2026
ReBPE: Iteratively Improving the Internal Structure of a Structured Tokeniser by Mining its Internal Structure
Abstract
AbstractRecent work has explored pruning merges from BPE subword tokenisers using corpus data as a signal for which merges to prune. We argue that because a BPE tokeniser contains a rich data structure on top of its vocabulary set, this in itself can be used as a guide to modify its merges such that segmentations become more desirable. We apply this argument to one of those pruning algorithms, BPE-knockout, by introducing a new reification step that suggests new merges by inspecting the effects left by pruning. By alternating both processes iteratively until convergence, we get a new BPE tokeniser, ReBPE, which outperforms the original BPE-knockout algorithm on morphological alignment in all 14 languages tested by over 11% F1 on average.
🧭
Keyword Pioneer
— bpe tokeniser
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio