ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models

Xiaomin Li; Xupeng Chen; Jingxuan Fan; Eric Hanchen Jiang; Mingye Gao

2026 AAAI AAAI 2026

ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models

Abstract

Abstract The safety alignment of large language models (LLMs) often relies on reinforcement learning from human feedback (RLHF), which requires human annotations to construct preference datasets. Given the challenge of assigning overall quality scores to data, recent works increasingly adopt fine-grained ratings based on multiple safety rules. In this paper, we discover a robust phenomenon: Rules with higher rating entropy tend to have lower accuracy in distinguishing human-preferred responses. Exploiting this insight, we propose ENCORE, a simple entropy-guided method to compose multi-head rewards by penalizing rules with high rating entropy. Theoretically, we show that such rules yield negligible weights under the Bradley–Terry loss during weight optimization, naturally justifying their penalization. Empirically, ENCORE consistently outperforms strong baselines, including random and uniform weighting, single-head Bradley–Terry, and LLM-as-a-judge, etc. on RewardBench safety tasks. Our method is completely training-free, generally applicable across datasets, and retains interpretability, making it a practical and effective approach for multi-attribute reward modeling.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning

🧭 Keyword Pioneer — entropy-guided reward

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy

Authors

Xiaomin Li , Xupeng Chen , Jingxuan Fan , Eric Hanchen Jiang , Mingye Gao

Topics

Artificial Intelligence > Core AI > AI Safety Machine Learning > Core Methods > Regression Machine Learning > Optimization & Theory > Optimization

Keywords

safety alignment bradley-terry model reward hacking entropy-guided reward multi-head reward model

Download PDF

Related papers

Hi-EF: Benchmarking Emotion Forecasting in Human-interaction 2026

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding 2026

Sparse3DPR: Training-Free 3D Hierarchical Scene Parsing and Task-Adaptive Subgraph Reasoning from Sparse RGB Views 2026

LayerEdit: Disentangled Multi-Object Editing via Conflict-Aware Multi-Layer Learning 2026

HDGS: Hierarchical Dynamic Gaussian Splatting for Urban Driving Scenes 2026