ReBPE: Iteratively Improving the Internal Structure of a Structured Tokeniser by Mining its Internal Structure

Thomas Bauwens; Miryam de Lhoneux

2026 EACL EACL 2026

ReBPE: Iteratively Improving the Internal Structure of a Structured Tokeniser by Mining its Internal Structure

Abstract

AbstractRecent work has explored pruning merges from BPE subword tokenisers using corpus data as a signal for which merges to prune. We argue that because a BPE tokeniser contains a rich data structure on top of its vocabulary set, this in itself can be used as a guide to modify its merges such that segmentations become more desirable. We apply this argument to one of those pruning algorithms, BPE-knockout, by introducing a new reification step that suggests new merges by inspecting the effects left by pruning. By alternating both processes iteratively until convergence, we get a new BPE tokeniser, ReBPE, which outperforms the original BPE-knockout algorithm on morphological alignment in all 14 languages tested by over 11% F1 on average.

🧭 Keyword Pioneer — bpe tokeniser

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Deep Learning, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Thomas Bauwens , Miryam de Lhoneux

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Optimization & Theory > Optimization

Keywords

subword tokenization morphological alignment bpe tokeniser tokeniser pruning

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026