Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
317 directly classified papers
Papers per year
2016: 1
2017: 1
2018: 4
2019: 8
2020: 11
2021: 21
2022: 29
2023: 36
2024: 87
2025: 117
2026: 2
Papers
Making Harmful Behaviors Unlearnable for Large Language Models
ACL 2024
SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models
ACL 2024
All Languages Matter: On the Multilingual Safety of LLMs
ACL 2024
Subtle Signatures, Strong Shields: Advancing Robust and Imperceptible Watermarking in Large Language Models
ACL 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
ACL 2024
A Chinese Dataset for Evaluating the Safeguards in Large Language Models
ACL 2024
UNIWIZ: A Unified Large Language Model Orchestrated Wizard for Safe Knowledge Grounded Conversations
ACL 2024
Realistic Evaluation of Toxicity in Large Language Models
ACL 2024
On the Hallucination in Simultaneous Machine Translation
ACL 2024
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
ACL 2024
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
ACL 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
ACL 2024
PKAD: Pretrained Knowledge is All You Need to Detect and Mitigate Textual Backdoor Attacks
EMNLP 2024
Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing
EMNLP 2024
Athena: Safe Autonomous Agents with Verbal Contrastive Learning
EMNLP 2024
Survival of the Safest: Towards Secure Prompt Optimization through Interleaved Multi-Objective Evolution
EMNLP 2024
Don’t be my Doctor! Recognizing Healthcare Advice in Large Language Models
EMNLP 2024
ULMR: Unlearning Large Language Models via Negative Response and Model Parameter Average
EMNLP 2024
WebOlympus: An Open Platform for Web Agents on Live Websites
EMNLP 2024
Safety Arithmetic: A Framework for Test-time Safety Alignment of Language Models by Steering Parameters and Activations
EMNLP 2024
Test Where Decisions Matter: Importance-driven Testing for Deep Reinforcement Learning
NIPS 2024
Can Protective Perturbation Safeguard Personal Data from Being Exploited by Stable Diffusion?
CVPR 2024
Backdoor Defense via Test-Time Detecting and Repairing
CVPR 2024
Defending Jailbreak Prompts via In-Context Adversarial Game
EMNLP 2024
Jailbreaking LLMs with Arabic Transliteration and Arabizi
EMNLP 2024
<
1
…
6
7
8
…
13
>