Perception-Guided Jailbreak Against Text-to-Image Models

Yihao Huang; Le Liang; Tianlin Li; Xiaojun Jia; Run Wang; Weikai Miao; Geguang Pu; Yang Liu

2025 AAAI AAAI 2025

Perception-Guided Jailbreak Against Text-to-Image Models

Abstract

Abstract In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yihao Huang , Le Liang , Tianlin Li , Xiaojun Jia , Run Wang , Weikai Miao , Geguang Pu , Yang Liu

Topics

Artificial Intelligence > Core AI > Responsible AI Machine Learning > Learning Types > Adversarial Learning Computer Vision > Generation > Image Generation Artificial Intelligence > Core AI > Adversarial Learning Deep Learning > Learning Types > Adversarial Learning

Keywords

prompt engineering adversarial attack text-to-image model jailbreak attack adversarial prompt large language model unsafe content

Download PDF

Related papers

BEV-TSR: Text-Scene Retrieval in BEV Space for Autonomous Driving 2025

APIRL: Deep Reinforcement Learning for REST API Fuzzing 2025

Anywhere: A Multi-Agent Framework for User-Guided, Reliable, and Diverse Foreground-Conditioned Image Generation 2025

3CAD: A Large-Scale Real-World 3C Product Dataset for Unsupervised Anomaly Detection 2025

Collaborative Learning for 3D Hand-Object Reconstruction and Compositional Action Recognition from Egocentric RGB Videos Using Superquadrics 2025