Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness

Xiaoyu Tan; Shaojie Shi; Xihe Qiu; Chao Qu; Zhenting Qi; Yinghui Xu; Yuan Qi

2023 EMNLP EMNLP 2023

Self-Criticism: Aligning Large Language Models with their Understanding of Helpfulness, Honesty, and Harmlessness

Abstract

AbstractRecently, there has been a notable surge in the significance of large language models (LLMs) that engage in conversational-style interactions, such as ChatGPT and Claude, as they contribute significantly to the progress of artificial general intelligence (AGI). Typically, these models undergo a two-phase fine-tuning process: instruction fine-tuning (IF) and reinforcement learning from human feedback (RLHF). These methods aim to align the LLMs to be helpful, honest, and harmless (HHH). However, RLHF, which incorporates independent reward models trained on high-quality human feedback datasets, incurs high costs in terms of hardware resources and human efforts. Therefore, we explore the possibility of aligning LLMs with their own understanding of HHH through IF and in-context learning (ICL). In this study, we propose a novel framework called Self-Criticism, which allows LLMs to align themselves with HHH based on the definition they learned from a large-scale text corpus. We begin by employing IF on a given instruction set and learning HHH discrimination through few-shot ICL. Subsequently, the LLMs evaluate their own generated responses and learn to produce “better” responses based on self-judgment. Finally, the model is retrained based on the self-generated responses to distill the whole process. By analyzing our proposed method, we also find interesting connections between Self-Criticism and goal-conditioned reinforcement learning, and pseudo-labeling. Experimental results demonstrate that this method achieves nearly identical performance to RLHF in terms of both human evaluation and evaluation by other LLMs, with only a minimal alignment tax.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Machine Learning and Natural Language Processing

🐣 Hot Topic Early Bird — instruction fine-tuning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Xiaoyu Tan , Shaojie Shi , Xihe Qiu , Chao Qu , Zhenting Qi , Yinghui Xu , Yuan Qi

Topics

Artificial Intelligence > Core AI > AI Safety Machine Learning > Learning Types > Self-Supervised Learning Natural Language Processing > Resources & Methods > Large Language Models

Keywords

in-context learning reinforcement learning from human feedback instruction fine-tuning large language model

Download PDF

Related papers

Exploring Linguistic Probes for Morphological Generalization 2023

NameGuess: Column Name Expansion for Tabular Data 2023

Vision-Enhanced Semantic Entity Recognition in Document Images via Visually-Asymmetric Consistency Learning 2023

Improving Conversational Recommendation Systems via Bias Analysis and Language-Model-Enhanced Data Augmentation 2023

On the Calibration of Large Language Models and Alignment 2023