BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models

BinMetric: A Comprehensive Binary Code Analysis Benchmark for Large Language Models

Xiuwei Shang, Guoqiang Chen, Shaoyin Cheng, Benlong Wu, Li Hu, Gangyang Li, Weiming Zhang, Nenghai Yu

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence
Main Track. Pages 7715-7723. https://doi.org/10.24963/ijcai.2025/858

Binary analysis is crucial for software security, offering insights into compiled programs without source code. As large language models (LLMs) excel in language tasks, their potential for complex decoding binary data structures is growing. However, the lack of standardized benchmarks hinders their evaluation and progress in this domain. To bridge this gap, we introduce BinMetric, a first comprehensive benchmark designed specifically to evaluate LLMs performance on binary analysis tasks. BinMetric comprises 1,000 questions derived from 20 real-world open-source projects across 6 practical binary analysis tasks, including decompilation, code summarization, etc., which reflect actual reverse engineering scenarios. Our empirical study on this benchmark investigates various state-of-the-art LLMs, revealing their strengths and limitations. The findings indicate that while LLMs show strong potential, challenges still exist, particularly in the areas of precise binary lifting and assembly synthesis. In summary, BinMetric makes a significant step forward in measuring binary analysis capabilities of LLMs, establishing a new benchmark leaderboard, and our study offers valuable insights for advancing LLMs in software security.
Keywords:
Multidisciplinary Topics and Applications: MTA: Software engineering
Machine Learning: ML: Benchmarks
Multidisciplinary Topics and Applications: MTA: Security and privacy
Natural Language Processing: NLP: Language models