VulnBench: A Comprehensive Benchmark for Transformer-Based Vulnerability Detection

Jake Norton; David Eyers; Veronica Liesaputra

2026 AAAI AAAI 2026

VulnBench: A Comprehensive Benchmark for Transformer-Based Vulnerability Detection

Abstract

Abstract Reproducible benchmarking of tools that automatically detect vulnerabilities in source code remains challenging due to inconsistent implementations, varying data preprocessing, and methodological flaws that compromise fair model comparison. In a recent study, 9 in 10 vulnerability detection studies were found to use inappropriate evaluation approaches, with models achieving high scores through spurious correlations rather than actual vulnerability detection. We present VulnBench, an extensible, open-source benchmarking tool that enables fair comparison across models and datasets. Our systematic evaluation of CodeBERT, GraphCodeBERT, CodeT5 (encoder-only and full), and NatGen across eight mostly C/C++ source code datasets reveals that proper threshold optimization can improve F1-scores by up to 54%, as well as wide variation in F1-scores showing the large gap in the difficulty of the vulnerability dataset field. By standardising evaluation protocols, VulnBench enables researchers to distinguish between genuine model improvements and methodological artifacts as well as reducing wasteful duplication of effort spent on reproducing results.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Machine Learning, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Jake Norton , David Eyers , Veronica Liesaputra

Topics

Machine Learning > Core Methods > Classification Deep Learning > Architectures > Transformers

Keywords

vulnerability detection code security fair evaluation

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026