Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Do Large Language Models Reflect Demographic Pluralism in Safety? EACL 2026

Persistent Personas? Role-Playing, Instruction Following, and Safety in Extended Interactions EACL 2026

Antisocial Behavior Prediction: A Survey and Practical Guide EACL 2026

Repairing Regex Vulnerabilities via Localization-Guided Instructions EACL 2026

Jailbreaking Safeguarded Text-to-Image Models via Large Language Models EACL 2026

Breach in the Shield: Unveiling the Vulnerabilities of Large Language Models EACL 2026

BAFLE-DCT: Bypassing Adversarial Filters via Frequency-Selective Embedding in the DCT Domain WACV 2026

UltraClean: A Simple Framework to Train Robust Neural Networks against Backdoor Attacks WACV 2026

Understanding Jailbreak Success: A Study of Latent Space Dynamics in Large Language Models EACL 2026

When Can We Trust LLMs in Mental Health? Large-Scale Benchmarks for Reliable LLM Evaluation EACL 2026

Detection of Adversarial Prompts with Model Predictive Entropy EACL 2026

A Simple and Efficient Learning-Style Prompting for LLM Jailbreaking EACL 2026

Process Evaluation for Agentic Systems EACL 2026

Code-Switching as a Safety Failure Mode in Large Language Models: An Empirical Study of Roman Urdu across English, Mixed, and Transliteration-Only Inputs EACL 2026

Position: Biomedical NLP Demands Specialization, Not Generalization EACL 2026

Rethinking the Evaluation of Alignment Methods: Insights into Diversity, Generalisation, and Safety EACL 2026

Conformal Feedback Alignment: Quantifying Answer-Level Reliability for Robust LLM Alignment EACL 2026

MedRiskEval: Medical Risk Evaluation Benchmark of Language Models, On the Importance of User Perspectives in Healthcare Settings EACL 2026

Being Kind Isn’t Always Being Safe: Diagnosing Affective Hallucination in LLMs EACL 2026

VortexPIA: Indirect Prompt Injection Attack against LLMs for Efficient Extraction of User Privacy EACL 2026

The Unintended Trade-off of AI Alignment: Balancing Hallucination Mitigation and Safety in LLMs EACL 2026

Open-Domain Safety Policy Construction EACL 2026

Safeguarding Language Models via Self-Destruct Trapdoor EACL 2026

PATCH: Mitigating PII Leakage in Language Models with Privacy-Aware Targeted Circuit PatcHing EACL 2026

DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection EACL 2026