Graphemes vs. phonemes: battling it out in character-based language models

Bastian Bunzeck; Daniel Duran; Leonie Schade; Sina Zarrieß

2024 CONLL CoNLL 2024

Graphemes vs. phonemes: battling it out in character-based language models

Abstract

AbstractWe present grapheme-llama and phoneme-llama, character-based language models trained for the 2024 BabyLM challenge. Through these models, we explore an under-researched approach to downsizing: replacing subword-based tokenization with character-level tokenization, drastically reducing the vocabulary size. The grapheme model is trained on a standard BabyLM dataset, while the phoneme model uses a phoneme-converted version of this dataset. Results show that grapheme-based models perform better overall, achieving scores comparable to subword-based models on grammatical benchmarks. Despite lower performance, phoneme models also demonstrate promising grammatical learning. We argue that our results challenge conventional wisdom on language modeling techniques and open up novel research questions with character- and phoneme-based models as objects of inquiry.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Bastian Bunzeck , Daniel Duran , Leonie Schade , Sina Zarrieß

Topics

Machine Learning > Core Methods > Representation Learning Natural Language Processing > Resources & Methods > Text Representation

Keywords

vocabulary size character-based language model

Download PDF

Related papers

Lossy Context Surprisal Predicts Task-Dependent Patterns in Relative Clause Processing 2024

Global-Pruner: A Stable and Efficient Pruner for Retraining-Free Pruning of Encoder-Based Language Models 2024

Transformer verbatim in-context retrieval across time and scale 2024

EditEval: An Instruction-Based Benchmark for Text Improvements 2024

An Empirical Comparison of Vocabulary Expansion and Initialization Approaches For Language Models 2024