Artificial Intelligence › Core AI ›

Safety

317 directly classified papers

Papers per year

Papers

Making Harmful Behaviors Unlearnable for Large Language Models ACL 2024

SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models ACL 2024

All Languages Matter: On the Multilingual Safety of LLMs ACL 2024

Subtle Signatures, Strong Shields: Advancing Robust and Imperceptible Watermarking in Large Language Models ACL 2024

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models ACL 2024

A Chinese Dataset for Evaluating the Safeguards in Large Language Models ACL 2024

UNIWIZ: A Unified Large Language Model Orchestrated Wizard for Safe Knowledge Grounded Conversations ACL 2024

Realistic Evaluation of Toxicity in Large Language Models ACL 2024

On the Hallucination in Simultaneous Machine Translation ACL 2024

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! ACL 2024

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety ACL 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs ACL 2024

PKAD: Pretrained Knowledge is All You Need to Detect and Mitigate Textual Backdoor Attacks EMNLP 2024

Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing EMNLP 2024

Athena: Safe Autonomous Agents with Verbal Contrastive Learning EMNLP 2024

Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution EMNLP 2024

Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models EMNLP 2024

ULMR: Unlearning Large Language Models via Negative Response and Model Parameter Average EMNLP 2024

WebOlympus: An Open Platform for Web Agents on Live Websites EMNLP 2024

Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations EMNLP 2024

Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning NIPS 2024

Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion? CVPR 2024

Backdoor Defense via Test-Time Detecting and Repairing CVPR 2024

Defending Jailbreak Prompts via In-Context Adversarial Game EMNLP 2024

Jailbreaking LLMs with Arabic Transliteration and Arabizi EMNLP 2024