FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

Yifeng He; Jicheng Wang; Yuyang Rong; Hao Chen

2025 EMNLP EMNLP 2025

FuzzAug: Data Augmentation by Coverage-guided Fuzzing for Neural Test Generation

Abstract

AbstractTesting is essential to modern software engineering for building reliable software.Given the high costs of manually creating test cases,automated test case generation, particularly methods utilizing large language models,has become increasingly popular.These neural approaches generate semantically meaningful tests that are more maintainable compared with traditional automated testing methods such as fuzzing.However, the diversity and volume of unit tests in current datasets are limited, especially for newer but important languages.In this paper, we present a novel data augmentation technique, *FuzzAug*,that brings the benefits of fuzzing to large language models by incorporating valid testing semantics and providing diverse coverage-guided inputs.Doubling the size of training datasets,FuzzAug improves performance over the baselines significantly.This technique demonstrates the potential of introducing prior knowledge from dynamic software analysisto improve neural test generation,offering significant enhancements in this task.Our code is open-sourced at https://github.com/SecurityLab-UCD/FuzzAug.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Science and Machine Learning

🧭 Keyword Pioneer — neural test generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yifeng He , Jicheng Wang , Yuyang Rong , Hao Chen

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Optimization & Theory > Optimization Machine Learning > Application Areas > Data Augmentation Computer Science > Applications > Software Engineering Machine Learning > Learning Types > Data Augmentation

Keywords

data augmentation software testing fuzz testing test case generation unit testing large language model coverage-guided fuzzing neural test generation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025