2026 EACL EACL 2026

ReBPE: Iteratively Improving the Internal Structure of a Structured Tokeniser by Mining its Internal Structure

Abstract

AbstractRecent work has explored pruning merges from BPE subword tokenisers using corpus data as a signal for which merges to prune. We argue that because a BPE tokeniser contains a rich data structure on top of its vocabulary set, this in itself can be used as a guide to modify its merges such that segmentations become more desirable. We apply this argument to one of those pruning algorithms, BPE-knockout, by introducing a new reification step that suggests new merges by inspecting the effects left by pruning. By alternating both processes iteratively until convergence, we get a new BPE tokeniser, ReBPE, which outperforms the original BPE-knockout algorithm on morphological alignment in all 14 languages tested by over 11% F1 on average.

🧭 Keyword Pioneer — bpe tokeniser
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio