Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment

Tiejin Chen; Xiaoou Liu; Vishnu Nandam; Kuan-Ru Liou; Hua Wei

2026 EACL EACL 2026

Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment

Abstract

AbstractPreference-based alignment like Reinforcement Learning from Human Feedback (RLHF) learns from pairwise preferences, yet the labels are often noisy and inconsistent. Existing uncertainty-aware approaches weight preferences, but ignore a more fundamental factor: the reliability of the answers being compared. To address the problem, we propose Conformal Feedback Alignment (CFA), a framework that grounds preference weighting in the statistical guarantees of Conformal Prediction (CP). CFA quantifies answer-level reliability by constructing conformal prediction sets with controllable coverage and aggregates these reliabilities into principled weights for both DPO- and PPO-style training. Experiments across different datasets show that CFA improves alignment robustness and data efficiency, highlighting that modeling answer-side uncertainty complements preference-level weighting and yields more robust, data-efficient alignment.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Reinforcement Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tiejin Chen , Xiaoou Liu , Vishnu Nandam , Kuan-Ru Liou , Hua Wei

Topics

Artificial Intelligence > Core AI > AI Safety Machine Learning > Optimization & Theory > Bayesian Inference Reinforcement Learning > Methods > Deep RL

Keywords

conformal prediction uncertainty quantification preference learning language model alignment reinforcement learning from human feedback

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026