BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models

Xiuwei Shang; Guoqiang Chen; Shaoyin Cheng; Benlong Wu; Li Hu; Gangyang Li; Weiming Zhang; Nenghai Yu

2025 IJCAI IJCAI 2025

BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models

Abstract

Binary analysis is crucial for software security, offering insights into compiled programs without source code. As large language models (LLMs) excel in language tasks, their potential for complex decoding binary data structures is growing. However, the lack of standardized benchmarks hinders their evaluation and progress in this domain. To bridge this gap, we introduce BinMetric, a first comprehensive benchmark designed specifically to evaluate LLMs performance on binary analysis tasks. BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. The findings indicate that while LLMs show strong potential, challenges still exist, particularly in the areas of precise binary lifting and assembly synthesis. In summary, BinMetric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — code decompilation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Xiuwei Shang , Guoqiang Chen , Shaoyin Cheng , Benlong Wu , Li Hu , Gangyang Li , Weiming Zhang , Nenghai Yu

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Resources & Methods > Large Language Models Computer Science > Applications > Cybersecurity Artificial Intelligence > Core AI > Large Language Models

Keywords

benchmark evaluation binary analysis code decompilation software security large language model

Download PDF

Related papers

Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain 2025

Responsibility Anticipation and Attribution in LTLf 2025

Argument-based Multi-Issue Negotiation 2025

Online Resource Sharing: Better Robust Guarantees via Randomized Strategies 2025

Equitable Mechanism Design for Facility Location 2025