Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation
EACL 2026
When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models
EACL 2026
Medical Summarization in Practice: Design, Deployment, and Analysis of a Clinical Summarization System for a German Hospital
EACL 2026
Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents
EACL 2026
Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models
EACL 2026
BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain
WACV 2026
UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks
WACV 2026
Teams of LLM Agents can Exploit Zero-Day Vulnerabilities
EACL 2026
Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models
EACL 2026
Attacker’s Noise Can Manipulate Your Audio-based LLM in the Real World
EACL 2026
CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection
EACL 2026
Unraveling LLM Jailbreaks Through Safety Knowledge Neurons
EACL 2026
Layer-wise Swapping for Generalizable Multilingual Safety
EACL 2026
Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space
EACL 2026
Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models
EACL 2026
ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity
EACL 2026
FaithLM: Towards Faithful Explanations for Large Language Models
EACL 2026
Attribution-Guided Multi-Object Hallucination and Bias Detection in Vision-Language Models
EACL 2026
ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models
EACL 2026
Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions
EACL 2026
Beyond Names: How Grammatical Gender Markers Bias LLM-based Educational Recommendations
EACL 2026
RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models
EACL 2026
Safeguarding Language Models via Self-Destruct Trapdoor
EACL 2026
How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities
EACL 2026
Detection of Adversarial Prompts with Model Predictive Entropy
EACL 2026
<
1
2
3
4
5
…
119
>