Artificial Intelligence › Core AI ›

Safety

317 directly classified papers

Papers per year

Papers

Hyperbolic Safety-Aware Vision-Language Models CVPR 2025

SAFENUDGE: Safeguarding Large Language Models in Real-time with Tunable Safety-Performance Trade-offs EMNLP 2025

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs ACL 2025

Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks ACL 2025

Hallucination Detection in LLMs Using Spectral Features of Attention Maps EMNLP 2025

AdaSteer: Your Aligned LLM is Inherently an Adaptive Jailbreak Defender EMNLP 2025

Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications ACL 2025

Leaky Thoughts: Large Reasoning Models Are Not Private Thinkers EMNLP 2025

Reasoning-to-Defend: Safety-Aware Reasoning Can Defend Large Language Models from Jailbreaking EMNLP 2025

Are Vision-Language Models Safe in the Wild? A Meme-Based Benchmark Study EMNLP 2025

Automating Steering for Safe Multimodal Large Language Models EMNLP 2025

VSCBench: Bridging the Gap in Vision-Language Model Safety Calibration ACL 2025

How to Fine-Tune Safely on a Budget: Model Adaptation Using Minimal Resources EMNLP 2025

sudoLLM: On Multi-role Alignment of Language Models EMNLP 2025

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation ACL 2025

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage ACL 2025

Root Defense Strategies: Ensuring Safety of LLM at the Decoding Level ACL 2025

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs ACL 2025

Rescorla-Wagner Steering of LLMs for Undesired Behaviors over Disproportionate Inappropriate Context EMNLP 2025

Monitoring of Perception Systems: Deterministic, Probabilistic, and Learning-Based Fault Detection and Identification (Abstract Reprint) AAAI 2024

Towards Trustworthy Deep Learning AAAI 2024

A Huber Loss Minimization Approach to Byzantine Robust Federated Learning AAAI 2024

GaLileo: General Linear Relaxation Framework for Tightening Robustness Certification of Transformers AAAI 2024

Long-Term Safe Reinforcement Learning with Binary Feedback AAAI 2024

Pure-Past Action Masking AAAI 2024