Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety

Denis Janiak; Julia Moska; Dawid Motyka; Karolina Seweryn; Paweł Walkowiak; Bartosz Żuk; Arkadiusz Janz

2026 EACL EACL 2026

Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety

Abstract

AbstractLarge language models (LLMs) require careful alignment to balance competing objectives: factuality, safety, conciseness, proactivity, and diversity. Existing studies focus on individual techniques or specific dimensions, lacking a holistic assessment of the inherent trade-offs. We propose a unified evaluation framework that compares LLM alignment methods (PPO, DPO, ORPO, KTO) across these five axes, using both in-distribution and out-of-distribution datasets. Leveraging a specialized LLM-as-Judge prompt, validated through human studies, we reveal that DPO and KTO excel in factual accuracy, PPO and DPO lead in safety, and PPO best balances conciseness with proactivity. Our findings provide insights into trade-offs of common alignment methods, guiding the development of more balanced and reliable LLMs.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Denis Janiak , Julia Moska , Dawid Motyka , Karolina Seweryn , Paweł Walkowiak , Bartosz Żuk , Arkadiusz Janz

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Responsible AI

Keywords

factual accuracy preference optimization language model safety evaluation alignment method

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026