Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

LLM BiasScope: A Real-Time Bias Analysis Platform for Comparative LLM Evaluation EACL 2026

When Prompt Optimization Becomes Jailbreaking: Adaptive Red-Teaming of Large Language Models EACL 2026

Medical Summarization in Practice: Design, Deployment, and Analysis of a Clinical Summarization System for a German Hospital EACL 2026

Assertion-Conditioned Compliance: A Provenance-Aware Vulnerability in Multi-Turn Tool-Calling Agents EACL 2026

Unleashing the Unseen: Harnessing Benign Datasets for Jailbreaking Large Language Models EACL 2026

BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain WACV 2026

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks WACV 2026

Teams of LLM Agents can Exploit Zero-Day Vulnerabilities EACL 2026

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models EACL 2026

Attacker’s Noise Can Manipulate Your Audio-based LLM in the Real World EACL 2026

CRADLE Bench: A Clinician-Annotated Benchmark for Multi-Faceted Mental Health Crisis and Safety Risk Detection EACL 2026

Unraveling LLM Jailbreaks Through Safety Knowledge Neurons EACL 2026

Layer-wise Swapping for Generalizable Multilingual Safety EACL 2026

Safe-Unsafe Concept Separation Emerges from a Single Direction in Language Models Activation Space EACL 2026

Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models EACL 2026

ToxiPrompt: A Two-Stage Red-Teaming Approach for Balancing Adversarial Prompt Diversity and Response Toxicity EACL 2026

FaithLM: Towards Faithful Explanations for Large Language Models EACL 2026

Attribution-Guided Multi-Object Hallucination and Bias Detection in Vision-Language Models EACL 2026

ARREST: Adversarial Resilient Regulation Enhancing Safety and Truth in Large Language Models EACL 2026

Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions EACL 2026

Beyond Names: How Grammatical Gender Markers Bias LLM-based Educational Recommendations EACL 2026

RefusalBench: Generative Evaluation of Selective Refusal in Grounded Language Models EACL 2026

Safeguarding Language Models via Self-Destruct Trapdoor EACL 2026

How Robust Are Router-LLMs? Analysis of the Fragility of LLM Routing Capabilities EACL 2026

Detection of Adversarial Prompts with Model Predictive Entropy EACL 2026