Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Mahdi Mostajabdaveh; Timothy Tin Long Yu; Samarendra Chandan Bindu Dash; Rindra Ramamonjison; Jabo Serge Byusa; Giuseppe Carenini; Zirui Zhou; Yong Zhang

2025 AAAI AAAI 2025

Evaluating LLM Reasoning in the Operations Research Domain with ORQA

Abstract

Abstract In this paper, we introduce and apply Operations Research Question Answering (ORQA), a new benchmark, to assess the generalization capabilities of Large Language Models (LLMs) in the specialized technical domain of Operations Research (OR). This benchmark is designed to evaluate whether LLMs can emulate the knowledge and reasoning skills of OR experts when given diverse and complex optimization problems. The dataset, crafted by OR experts, presents real-world optimization problems that require multistep reasoning to build their mathematical models. Our evaluations of various open-source LLMs, such as LLaMA 3.1, DeepSeek, and Mixtral reveal their modest performance, indicating a gap in their aptitude to generalize to specialized technical domains. This work contributes to the ongoing discourse on LLMs’ generalization capabilities, providing insights for future research in this area. The dataset and evaluation code are publicly available.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mahdi Mostajabdaveh , Timothy Tin Long Yu , Samarendra Chandan Bindu Dash , Rindra Ramamonjison , Jabo Serge Byusa , Giuseppe Carenini , Zirui Zhou , Yong Zhang

Topics

Machine Learning > Optimization & Theory > Optimization Natural Language Processing > Applications > Question Answering Natural Language Processing > Resources & Methods > Large Language Models Artificial Intelligence > Core AI > Reasoning Deep Learning > Models > Large Language Models

Keywords

benchmark evaluation question answering large language model operations research

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025