ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Jiarui Lu; Thomas Holleis; Yizhe Zhang; Bernhard Aumayer; Feng Nan; Haoping Bai; Shuang Ma; Shen Ma; Mengyu Li; Guoli Yin; Zirui Wang; Ruoming Pang

2025 NAACL NAACL 2025

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Abstract

AbstractRecent large language models (LLMs) advancements sparked a growing research interest in tool assisted LLMs solving real-world challenges, which calls for comprehensive evaluation of tool-use capabilities. While previous works focused on either evaluating over stateless web services (RESTful API), based on a single turn user prompt, or an off-policy dialog trajectory, ToolSandbox includes stateful tool execution, implicit state dependencies between tools, a built-in user simulator supporting on-policy conversational evaluation and a dynamic evaluation strategy for intermediate and final milestones over arbitrary trajectory. We show that open source and proprietary models has a significant performance gap, and complex tasks like State Dependency, Canonicalization and Insufficient Information defined in ToolSandbox are challenging even the most capable SOTA LLMs, providing brand-new insights to tool-use LLM capabilities. Datasets and evaluation scripts of ToolSandbox are released at <placeholder>.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — stateful execution

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jiarui Lu , Thomas Holleis , Yizhe Zhang , Bernhard Aumayer , Feng Nan , Haoping Bai , Shuang Ma , Shen Ma , Mengyu Li , Guoli Yin , Zirui Wang , Ruoming Pang

Topics

Artificial Intelligence > Core AI > Agent Systems Natural Language Processing > Resources & Methods > Large Language Models

Keywords

tool use agent system large language model conversational evaluation stateful execution

Download PDF

Few-shot Personalization of LLMs with Mis-aligned Responses 2025

NLI under the Microscope: What Atomic Hypothesis Decomposition Reveals 2025

Understanding Figurative Meaning through Explainable Visual Entailment 2025

CogLM: Tracking Cognitive Development of Large Language Models 2025

ToolSandbox: A Stateful, Conversational, Interactive Evaluation Benchmark for LLM Tool Use Capabilities

Abstract

Authors

Topics

Keywords

Related papers