2025 EMNLP EMNLP 2025

BanglaByT5: Byte-Level Modelling for Bangla

Abstract

AbstractLarge language models (LLMs) have achievedremarkable success across various natural lan-guage processing tasks. However, most LLMmodels use traditional tokenizers like BPE andSentencePiece, which fail to capture the finernuances of a morphologically rich languagelike Bangla (Bengali). In this work, we introduce BanglaByT5, the first byte-level encoder-decoder model explicitly tailored for Bangla.Built upon a small variant of Google’s ByT5architecture, BanglaByT5 is pre-trained on a14GB curated corpus combining high-qualityliterary and newspaper articles. Through zero-shot and supervised evaluations across gen-erative and classification tasks, BanglaByT5demonstrates competitive performance, surpassing several multilingual and larger models.Our findings highlight BanglaByT5’s potentialas a lightweight yet powerful tool for BanglaNLP, particularly in resource-constrained orscalable environments. BanglaByT5 is pub-licly available for download from https://huggingface.co/Vacaspati/BanglaByT5.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio