Artificial Intelligence › Core AI ›

AI Safety

2972 directly classified papers

Papers per year

Papers

Failures to Surface Harmful Contents in Video Large Language Models AAAI 2026

Reference Recommendation Based Membership Inference Attack Against Hybrid-Based Recommender Systems AAAI 2026

Activation Manipulation Attack: Penetrating and Harmful Jailbreak Attack Against Large Vision-Language Models AAAI 2026

FILTER: A Framework for Defending Against Backdoor Attacks in Vertical Federated Learning AAAI 2026

Higher-Order Responsibility AAAI 2026

SceneJailEval: A Scenario-Adaptive Multi-Dimensional Framework for Jailbreak Evaluation AAAI 2026

Model-Agnostic Sentiment Distribution Stability Analysis for Robust LLM-Generated Texts Detection AAAI 2026

IS-Bench: Evaluating Interactive Safety of VLM-Driven Embodied Agents in Daily Household Tasks AAAI 2026

Learning Vision-Based Neural Network Controllers with Semi-Probabilistic Safety Guarantees AAAI 2026

Dynamic Deep Prompt Optimization for Defending Against Jailbreak Attacks on LLMs AAAI 2026

Efficient Verification and Falsification of ReLU Neural Barrier Certificates AAAI 2026

Probing Semantic Insensitivity for Inference-Time Backdoor Defense in Multimodal Large Language Model AAAI 2026

Joint-GCG: Unified Gradient-Based Poisoning Attacks on Retrieval-Augmented Generation Systems AAAI 2026

MCPTox: A Benchmark for Tool Poisoning on Real-World MCP Servers AAAI 2026

ConfGuard: A Simple and Effective Backdoor Detection for Large Language Models AAAI 2026

MPMA: Preference Manipulation Attack Against Model Context Protocol AAAI 2026

AntiDote: Bi-level Adversarial Training for Tamper-Resistant LLMs AAAI 2026

Efficient Hallucination Detection: Adaptive Bayesian Estimation of Semantic Entropy with Guided Semantic Exploration AAAI 2026

SafetyReminder: Reviving Delayed Safety Awareness of Vision-Language Models to Defend Against Jailbreak Attacks AAAI 2026

Mitigating Content Effects on Reasoning in Language Models Through Fine-Grained Activation Steering AAAI 2026

When Truth Is Overridden: Uncovering the Internal Origins of Sycophancy in Large Language Models AAAI 2026

Hallucination as a Computational Boundary: A Hierarchy of Inevitability and the Oracle Escape AAAI 2026

Multi-Value Alignment for LLMs via Value Decorrelation and Extrapolation AAAI 2026

Benchmarking and Enhancing Rule Knowledge-Driven Reasoning of Large Language Models AAAI 2026

Test-time Prompt Intervention AAAI 2026