BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Sadia Alam; Md Farhan Ishmam; Navid Hasin Alvee; Md Shahnewaz Siddique; Md Azam Hossain; Abu Raihan Mostofa Kamal

2025 COLING COLING 2025

BnSentMix: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis

Abstract

AbstractThe widespread availability of code-mixed data in digital spaces can provide valuable insights into low-resource languages like Bengali, which have limited annotated corpora. Sentiment analysis, a pivotal text classification task, has been explored across multiple languages, yet code-mixed Bengali remains underrepresented with no large-scale, diverse benchmark. Code-mixed text is particularly challenging as it requires the understanding of multiple languages and their interaction in the same text. We address this limitation by introducing BnSentMix, a sentiment analysis dataset on code-mixed Bengali comprising 20,000 samples with 4 sentiment labels, sourced from Facebook, YouTube, and e-commerce sites. By aggregating multiple sources, we ensure linguistic diversity reflecting realistic code-mixed scenarios. We implement a novel automated text filtering pipeline using fine-tuned language models to detect code-mixed samples and expand code-mixed text corpora. We further propose baselines using machine learning, neural networks, and transformer-based language models. The availability of a diverse dataset is a critical step towards democratizing NLP and ultimately contributing to a better understanding of code-mixed languages.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Sadia Alam , Md Farhan Ishmam , Navid Hasin Alvee , Md Shahnewaz Siddique , Md Azam Hossain , Abu Raihan Mostofa Kamal

Topics

Machine Learning > Learning Types > Transfer Learning Natural Language Processing > Applications > Sentiment Analysis Deep Learning > Models > Transformers

Keywords

sentiment analysis text classification sentiment classification code-mixed text low-resource language code-mixed language bengali language transformer model

Download PDF

Related papers

Navigating Dialectal Bias and Ethical Complexities in Levantine Arabic Hate Speech Detection 2025

TaCIE: Enhancing Instruction Comprehension in Large Language Models through Task-Centred Instruction Evolution 2025

Positive Text Reframing under Multi-strategy Optimization 2025

RAM2C: A Liberal Arts Educational Chatbot based on Retrieval-augmented Multi-role Multi-expert Collaboration 2025

Two-stage Incomplete Utterance Rewriting on Editing Operation 2025