Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Benchmark

Mohammad Khodadad; Ali Shiraee Kasmaee; Mahdi Astaraki; Nicholas Sherck; Hamidreza Mahyar; Soheila Samiee

2026 EACL EACL 2026

Evaluating Multi-Hop Reasoning in Large Language Models: A Chemistry-Centric Benchmark

Abstract

AbstractWe introduce ChemComp, the first chemistry-focused benchmark for evaluating compositional multi-hop reasoning in large language models (LLMs). Our automated pipeline constructs benchmarks from proprietary or public data by integrating generative reasoning models, chemical named-entity recognition, and external knowledge bases to build knowledge graphs. Applied to recent chemistry literature, this approach minimizes overlap with LLM pretraining data. The resulting dataset comprises 1,188 multi-hop questions, refined through domain-expert feedback and robust evaluation protocols.Using ChemComp, we systematically compare LLM performance with and without retrieval augmentation, including an idealized gold-context scenario. Our results show that even state-of-the-art models struggle with compositional reasoning: retrieval significantly improves accuracy, yet reasoning errors persist even under perfect retrieval. These findings highlight the limitations of current LLMs and the critical role of retrieval-augmented methods in scientific reasoning. Furthermore, our pipeline is generalizable with fine-tuning, enabling the creation of challenging multi-hop reasoning benchmarks across domains and proprietary datasets.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Mohammad Khodadad , Ali Shiraee Kasmaee , Mahdi Astaraki , Nicholas Sherck , Hamidreza Mahyar , Soheila Samiee

Topics

Natural Language Processing > Applications > Question Answering Natural Language Processing > Resources & Methods > Large Language Models

Keywords

knowledge graph multi-hop reasoning retrieval augmentation large language model

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026