HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content

Lorenzo Puppi Vecchi; Alceu De Souza Britto Jr.; Emerson Cabrera Paraiso; Rafael M. O. Cruz

2026 EACL EACL 2026

HARM: Learning Hate-Aware Reward Model for Evaluating Natural Language Explanations of Offensive Content

Abstract

AbstractExplaining why content is hateful using natural language is crucial for fostering transparency in automated content moderation systems. However, evaluating the quality of such explanations remains an open challenge. General-purpose reward models (RMs), commonly used for scoring natural language outputs, are typically optimized for broad notions of safety. We argue that this optimization penalizes situations where references to stereotypes or offensive content are essential for explanations with higher explanatory fidelity. To address this gap, we introduce SBIC-Explain, a human-validated dataset of 370,788 LLM generated NLEs for offensive content, spanning three levels of human-annotated contextual richness: Tier 1: text-only, Tier 2: + classification-aware, and Tier 3: + semantics-informed. We hypothesize that as human-annotated context increases, explanations should lead to higher perceived explanations with higher explanatory fidelity. Yet, we find that existing RMs systematically assign lower scores to more contextually rich (and often more offensive) explanations, revealing a misalignment between model preferences and explanatory fidelity for this context. We propose HARM (Hate-Aware Reward Model), a RM that integrates interpretable signals to better align reward scores with the needs of hate speech explanation. HARM outperforms general-purpose baselines, improving NLE pair-wise preference. Available at: https://github.com/Lorenzo815/HARM.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — explanatory fidelity

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Lorenzo Puppi Vecchi , Alceu De Souza Britto Jr. , Emerson Cabrera Paraiso , Rafael M. O. Cruz

Topics

Artificial Intelligence > Core AI > Interpretability Natural Language Processing > Applications > Text Classification

Keywords

reward model hate speech detection natural language explanation offensive content explanatory fidelity

Download PDF

Related papers

Investigating Gender Stereotypes in Large Language Models via Social Determinants of Health 2026

A Benchmark for Audio Reasoning Capabilities of Multimodal Large Language Models 2026

InfiGUIAgent: A Multimodal Generalist GUI Agent with Native Reasoning and Reflection 2026

Generative Personality Simulation via Theory-Informed Structured Interview 2026

Word Surprisal Correlates with Sentential Contradiction in LLMs 2026