Course-Correction: Safety Alignment Using Synthetic Preferences

Rongwu Xu; Yishuo Cai; Zhenhong Zhou; Renjie Gu; Haiqin Weng; Liu Yan; Tianwei Zhang; Wei Xu; Han Qiu

2024 EMNLP EMNLP 2024

Course-Correction: Safety Alignment Using Synthetic Preferences

Abstract

AbstractThe risk of harmful contents generated by large language models (LLMs) becomes a critical concern. This paper systematically evaluates and enhances LLMs’ capability to perform course-correction, , the model can steer away from generating harmful content autonomously. First, we introduce the C2-Eval benchmark for quantitative assessment and analyze 10 popular LLMs, revealing varying proficiency of current safety-tuned LLMs in course-correction.To improve, we propose fine-tuning LLMs with preference learning, emphasizing the preference for timely course-correction. Using an automated pipeline, we create C2-Syn, a synthetic C2-Syn with 750K pairwise preferences, to teach models the concept of timely course-correction through data-driven learning.Experiments on Llama2-Chat 7B and Qwen2 7B show that our method effectively enhances course-correction skills without affecting general performance. Additionally, it effectively improves LLMs’ safety, particularly in resisting jailbreak attacks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rongwu Xu , Yishuo Cai , Zhenhong Zhou , Renjie Gu , Haiqin Weng , Liu Yan , Tianwei Zhang , Wei Xu , Han Qiu

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Responsible AI Deep Learning > Learning Types > Fine-Tuning

Keywords

preference learning harmful content safety alignment synthetic datum jailbreak attack large language model

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024