2026 EACL EACL 2026

XLMR-Urdu at AbjadGenEval Shared Task: A Data-Centric Transformer-Based Approach for AI-Generated Urdu Text Detection

Abstract

AbstractThe rapid advancement of large language models (LLMs) has led to a substantial increase in automatically generated textual content, raising concerns regarding misinformation, plagiarism, and authorship verification. These challenges are particularly pronounced for low-resource languages such as Urdu, where limited annotated data and complex linguistic properties hinder robust detection. In this paper, we present a transformer-based approach for binary classification of human-written versus AI-generated Urdu text, developed for the AbjadGenEval Task 2 shared task. Beyond model fine-tuning, we adopt a data-centric perspective, emphasizing dataset diagnostics, document-level inference, and calibration strategies. Our system achieves strong performance on the official test set, with an F1-score of 88.68% and balanced accuracy of 88.71%. Through empirical analysis, we demonstrate that dataset characteristics and generator-specific artifacts play a dominant role in model generalization, highlighting critical directions for future research in low-resource AI-generated text detection.

🌉 Interdisciplinary Bridge — Machine Learning and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio