Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Artificial Intelligence
›
Core AI
›
Safety
317 directly classified papers
Papers per year
2016: 1
2017: 1
2018: 4
2019: 8
2020: 11
2021: 21
2022: 29
2023: 36
2024: 87
2025: 117
2026: 2
Papers
Pure-Past Action Masking
AAAI 2024
Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders
ACL 2024
TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification
ACL 2024
Reward Certification for Policy Smoothed Reinforcement Learning
AAAI 2024
CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion
ACL 2024
Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions
ACL 2024
Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing
AAAI 2024
Athena: Safe Autonomous Agents with Verbal Contrastive Learning
EMNLP 2024
Making Harmful Behaviors Unlearnable for Large Language Models
ACL 2024
SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models
ACL 2024
DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation
AAAI 2024
All Languages Matter: On the Multilingual Safety of LLMs
ACL 2024
Subtle Signatures, Strong Shields: Advancing Robust and Imperceptible Watermarking in Large Language Models
ACL 2024
Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation
AAAI 2024
SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models
ACL 2024
A Chinese Dataset for Evaluating the Safeguards in Large Language Models
ACL 2024
Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming
AAAI 2024
UNIWIZ: A Unified Large Language Model Orchestrated Wizard for Safe Knowledge Grounded Conversations
ACL 2024
Realistic Evaluation of Toxicity in Large Language Models
ACL 2024
Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy
AAAI 2024
On the Hallucination in Simultaneous Machine Translation
ACL 2024
Emulated Disalignment: Safety Alignment for Large Language Models May Backfire!
ACL 2024
Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise
AAAI 2024
PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety
ACL 2024
ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs
ACL 2024
<
1
…
5
6
7
…
13
>