SLM as Guardian: Pioneering AI Safety with Small Language Model

Ohjoon Kwon; DongHyeon Jeon; Nayoung Choi; Gyu-Hwung Cho; Hwiyeol Jo; Changbong Kim; Hyunwoo Lee; Inho Kang; Sun Kim; Taiwoo Park

2024 EMNLP EMNLP 2024

SLM as Guardian: Pioneering AI Safety with Small Language Model

Abstract

AbstractMost prior safety research of large language models (LLMs) has focused on enhancing the alignment of LLMs to better suit the safety requirements of their use cases. However, internalizing such safeguard features into larger models brought challenges of higher training cost and unintended degradation of helpfulness. In this paper, we leverage a smaller LLM for both harmful query detection and safeguard response generation. We introduce our safety requirements and the taxonomy of harmfulness categories, and then propose a multi-task learning mechanism fusing the two tasks into a single model. We demonstrate the effectiveness of our approach, providing on par or surpassing harmful query detection and safeguard response performance compared to the publicly available LLMs.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — harmful query detection

🐣 Hot Topic Early Bird — harmful content detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Ohjoon Kwon , DongHyeon Jeon , Nayoung Choi , Gyu-Hwung Cho , Hwiyeol Jo , Changbong Kim , Hyunwoo Lee , Inho Kang , Sun Kim , Taiwoo Park

Topics

Artificial Intelligence > Core AI > Agent Systems Artificial Intelligence > Core AI > AI Safety Artificial Intelligence > Core AI > Responsible AI Deep Learning > Learning Types > Multi-Task Learning

Keywords

multi-task learning ai safety harmful content detection small language model harmful query detection safeguard response safeguard response generation

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024