SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

Bo Zhang; Cong Gao; Linkang Yang; Bingxu Han; Minghao Hu; Zhunchen Luo; Guotong Geng; Xiaoying Bai; Jun Zhang; Wen Yao; Zhong Wang

2025 EMNLP EMNLP 2025

SafeConf: A Confidence-Calibrated Safety Self-Evaluation Method for Large Language Models

Abstract

AbstractLarge language models (LLMs) have achieved groundbreaking progress in Natural Language Processing (NLP). Despite the numerous advantages of LLMs, they also pose significant safety risks. Self-evaluation mechanisms have gained increasing attention as a key safeguard to ensure safe and controllable content generation. However, LLMs often exhibit overconfidence, which seriously compromises the accuracy of safety self-evaluation. To address this challenge, we propose SafeConf, a method to enhance the safety self-evaluation capability of LLMs through confidence calibration. The method performs semantic mutations on the original safety evaluation questions and adopts a self-consistency strategy to quantify confidence based on answer accuracy on the mutated questions. Finally, these confidence scores are used to construct a dataset for fine-tuning. We conducte experiments on both Chinese and English datasets. The results show that SafeConf improves self-evaluation accuracy by an average of 5.86% and 7.79% over the state-of-the-art baseline methods on Qwen2.5-7B-Instruct and Llama3-8B-Instruct models, respectively, without affecting the general capabilities of the models.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — safety self-evaluation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Bo Zhang , Cong Gao , Linkang Yang , Bingxu Han , Minghao Hu , Zhunchen Luo , Guotong Geng , Xiaoying Bai , Jun Zhang , Wen Yao , Zhong Wang

Topics

Artificial Intelligence > Core AI > AI Safety Machine Learning > Optimization & Theory > Bayesian Inference

Keywords

confidence calibration semantic mutation safety self-evaluation

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025