InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

Pengyu Wang; Dong Zhang; Linyang Li; Chenkun Tan; Xinghao Wang; Mozhi Zhang; Ke Ren; Botian Jiang; Xipeng Qiu

2024 EMNLP EMNLP 2024

InferAligner: Inference-Time Alignment for Harmlessness through Cross-Model Guidance

Abstract

AbstractAs large language models (LLMs) rapidly evolve, they are increasingly being customized through fine-tuning to suit the specific needs of various applications. A critical aspect of this advancement is the alignment process, which ensures that these models perform tasks in ways that align with human values and expectations. Current alignment methods, such as direct preference optimization (DPO) and reinforcement learning from human feedback (RLHF), focus primarily on alignment during training phase. However, these methods often involve complex and resource-intensive training processes, posing significant challenge for their implementation. Therefore, we propose InferAligner, a simple yet effective method for harmlessness alignment during inference phase. InferAligner decouples harmlessness from helpfulness. During the training phase, it focuses solely on enhancing the target model’s capabilities on downstream tasks. In the inference phase, it utilizes safety steering vectors extracted from the aligned model to guide the target model towards harmlessness alignment. Experimental results show that our method can be very effectively applied to domain-specific models in finance, medicine, and mathematics, as well as to multimodal large language models (MLLMs) such as LLaVA. It significantly diminishes the attack success rate (ASR) of both harmful instructions and jailbreak instructions, while maintaining almost unchanged performance in downstream tasks.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — harmlessness alignment

🐣 Hot Topic Early Bird — model safety

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Pengyu Wang , Dong Zhang , Linyang Li , Chenkun Tan , Xinghao Wang , Mozhi Zhang , Ke Ren , Botian Jiang , Xipeng Qiu

Topics

Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Responsible AI Deep Learning > Techniques > Model Architecture Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Knowledge Distillation

Keywords

direct preference optimization model safety reinforcement learning from human feedback inference-time alignment steering vector attack success rate large language model harmlessness alignment safety steering vector cross-model guidance

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024