Papers
3,922 papers found
BabyBabelLM: A Multilingual Benchmark of Developmentally Plausible Training Data
Jaap Jumelet, Abdellah Fourtassi, Akari Haga et al.
Back-of-the-Book Index Automation for Arabic Documents
Nawal Haidar, Ahmad Kashmar, Fadi Zaraket
Balanced Accuracy: The Right Metric for Evaluating LLM Judges - Explained through Youden’s J statistic
Stephane Collot, Colin Fraser, Justin Zhao et al.
Balancing Fluency and Adherence: Hybrid Fallback Term Injection in Low-Resource Terminology Translation
Kurt Abela, Marc Tanti, Claudia Borg
BanglaIPA: Towards Robust Text-to-IPA Transcription with Contextual Rewriting in Bengali
Jakir Hasan, Shrestha Datta, Md Saiful Islam et al.
BanglaLlama: LLaMA for Bangla Language
Abdullah Khan Zehady, Shubhashis Roy Dipta, Naymul Islam et al.
BanglaSummEval: Reference-Free Factual Consistency Evaluation for Bangla Summarization
Ahmed Rafid, Rumman Adib, Fariya Ahmed et al.
BanSuite: A Unified Toolkit and Software Platform for Low-Resource NLP in Bangla
Md. Abu Sayed, Faisal Ahamed Khan, Jannatul Ferdous Tuli et al.
Barriers to Discrete Reasoning with Transformers: A Survey Across Depth, Exactness, and Bandwidth
Michelle Yuan, Weiyi Sun, Amir H. Rezaeian et al.
BayesFlow: A Probability Inference Framework for Meta-Agent Assisted Workflow Generation
Bo Yuan, Yun Zhou, Zhichao Xu et al.
Becoming Experienced Judges: Selective Test-Time Learning for Evaluators
Seungyeon Jwa, Daechul Ahn, Reokyoung Kim et al.
BeDiscovER: The Benchmark of Discourse Understanding in the Era of Reasoning Language Models
Chuyuan Li, Giuseppe Carenini
BeeParser at MWE-2026 PARSEME 2.0 Subtask 1: Can Cross-Lingual Interactions Improve MWE Identification?
Ahmet Erdem, Oguzhan Karaarslan
Being Kind Isn’t Always Being Safe: Diagnosing Affective Hallucination in LLMs
Sewon Kim, Jiwon Kim, SeungWoo Shin et al.
Benchmarking and Mitigating the Impact of Noisy User Prompts in Medical VLMs via Cross-Modal Reflection
Zhiyu Xue, Reza Abbasi-Asl, Ramtin Pedarsani
Benchmarking Direct Preference Optimization for Medical Large Vision–Language Models
Dain Kim, Jiwoo Lee, Jaehoon Yun et al.
Benchmarking Hate Speech Detection in Azerbaijani with Turkish Cross-Lingual Transfer and Transformer Models
Tural Alizada, Haim Dubossarsky
Benchmarking Offensive Language Detection in Persian and Pashto
Zahra Bokaei, Bonnie Webber, Walid Magdy
Benchmarking Temporal Reasoning and Alignment Across Chinese Dynasties
Zhenglin Wang, Jialong Wu, Pengfei Li et al.
Benchmarking the Energy Savings with Speculative Decoding Strategies
Rohit Dutta, Paramita Koley, Soham Poddar et al.
BERT, are you paying attention? Attention regularization with human-annotated rationales
Elize Herrewijnen, Dong Nguyen, Floris Bex et al.
Best-of-L: Cross-Lingual Reward Modeling for Mathematical Reasoning
Sara Rajaee, Rochelle Choenni, Ekaterina Shutova et al.
Better as Generators Than Classifiers: Leveraging LLMs and Synthetic Data for Low-Resource Multilingual Classification
Branislav Pecher, Jan Cegin, Robert Belanec et al.
Better Call CLAUSE: A Discrepancy Benchmark for Auditing LLMs Legal Reasoning Capabilities
Manan Roy Choudhury, Adithya Chandramouli, Mannan Anand et al.