Moral Self-correction is Not An Innate Capability in Language Models

Guangliang Liu; Zimo Qi; Xitong Zhang; Lu Cheng; Kristen Johnson

2025 IJCNLP IJCNLP 2025

Moral Self-correction is Not An Innate Capability in Language Models

Abstract

AbstractAlthough there has been growing interest in the self-correction capability of Large Language Models (LLMs), there are varying conclusions about its effectiveness.Prior research has largely concentrated on intrinsic self-correction, extrinsic self-correction, particularly the interplay between internal knowledge and external feedback, remains underexplored. In this paper, we aim to comprehensively investigate the underlying mechanism of moral self-correction by addressing a fundamental question: is moral self-correction an innate capability of LLMs? Specifically, we conduct: (1) a behavioral analysis of LLMs’ moral sensitivity based on a self-distinguishing task; and (2) a mechanistic analysis of the hidden states to examine how key components of self-correction, such as Chain-of-Thought (CoT) and external feedback, interact to facilitate moral self-correction. Drawing on empirical evidence from both behavioral and mechanistic analyses, we demonstrate that moral self-correction is not an inherent capability of LLMs, as they are neither morally sensitive nor able to effectively incorporate external feedback during the self-correction process.

🧭 Keyword Pioneer — external feedback

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing