Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
317 directly classified papers
Papers per year
2016: 1
2017: 1
2018: 4
2019: 8
2020: 11
2021: 21
2022: 29
2023: 36
2024: 87
2025: 117
2026: 2
Papers
GuardBench: A Large-Scale Benchmark for Guardrail Models
EMNLP 2024
Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification
EMNLP 2024
From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking
EMNLP 2024
CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference
EMNLP 2024
Distract Large Language Models for Automatic Jailbreak Attack
EMNLP 2024
MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance
EMNLP 2024
BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting
EMNLP 2024
Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction
EMNLP 2024
MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models
NIPS 2024
Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models
NIPS 2024
Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models
EMNLP 2024
Red Teaming Language Models for Processing Contradictory Dialogues
EMNLP 2024
Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights
EMNLP 2024
Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis
EMNLP 2024
Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering
EMNLP 2024
ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings
EMNLP 2024
VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch
AAAI 2024
Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition
NIPS 2024
Mission Impossible: A Statistical Perspective on Jailbreaking LLMs
NIPS 2024
Simplifying Constraint Inference with Inverse Reinforcement Learning
NIPS 2024
Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration
AAAI 2024
The Art of Saying No: Contextual Noncompliance in Language Models
NIPS 2024
Unelicitable Backdoors via Cryptographic Transformer Circuits
NIPS 2024
JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models
NIPS 2024
ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation
NIPS 2024
<
1
…
7
8
9
…
13
>