Research Explorer
Papers
Conferences
Authors
Topics
Keywords
Trends
Achievements
Explore
← Core AI
Artificial Intelligence
›
Core AI
›
AI Safety
2972 directly classified papers
Papers per year
2002: 1
2006: 1
2007: 1
2012: 4
2013: 1
2015: 5
2016: 1
2017: 13
2018: 40
2019: 91
2020: 111
2021: 181
2022: 204
2023: 333
2024: 642
2025: 1031
2026: 312
Papers
PrivSV: Differentially Private Steering Vector for Large Language Models
AAAI 2026
ShieldRAG: Safeguarding Retrieval-Augmented Generation from Untrusted Knowledge Bases
AAAI 2026
AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin
AAAI 2026
SPA: Achieving Consensus in LLM Alignment via Self-Priority Optimization
AAAI 2026
Bootstrapping LLMs via Preference-Based Policy Optimization
AAAI 2026
Backdooring Rationalization
AAAI 2026
Reinforce Trustworthiness in Multimodal Emotional Support System
AAAI 2026
ENCORE: Entropy-guided Reward Composition for Multi-head Safety Reward Models
AAAI 2026
LoopLLM: Transferable Energy-Latency Attacks in LLMs via Repetitive Generation
AAAI 2026
Safe RAG by RAG: Untying the Bell That RAG Rang with the RAG Hand
AAAI 2026
Query-Routed Activation Editing with Truth-hierarchical Preference Optimization
AAAI 2026
Hidden in the Noise: Unveiling Backdoors in Audio LLMs Alignment Through Latent Acoustic Pattern Triggers
AAAI 2026
SafeNLIDB: A Privacy-Preserving Safety Alignment Framework for LLM-based Natural Language Database Interfaces
AAAI 2026
BadThink: Triggered Overthinking Attacks on Chain-of-Thought Reasoning in Large Language Models
AAAI 2026
Hallucinate Less by Thinking More: Aspect-Based Causal Abstention for Large Language Models
AAAI 2026
WALKSAFE: Risk-aware Graph Random Walk with Bi-GRPO for LLM Safety
AAAI 2026
SOM Directions Are Better than One: Multi-Directional Refusal Suppression in Language Models
AAAI 2026
MAJIC: Markovian Adaptive Jailbreaking via Iterative Composition of Diverse Innovative Strategies
AAAI 2026
Mental Model-based Generation of Lies for Insider Threat Modeling
AAAI 2026
W2S-AlignTree: Weak-to-Strong Inference-Time Alignment for Large Language Models via Monte Carlo Tree Search
AAAI 2026
Control Illusion: The Failure of Instruction Hierarchies in Large Language Models
AAAI 2026
FaithLM: Towards Faithful Explanations for Large Language Models
EACL 2026
DUP: Detection-guided Unlearning for Backdoor Purification in Language Models
AAAI 2026
Model Editing as a Double-Edged Sword: Steering Agent Behavior Toward Beneficence or Harm
AAAI 2026
Breaking the Stealth-Potency Trade-off in Clean-Image Backdoors with Generative Trigger Optimization
AAAI 2026
<
1
…
4
5
6
…
119
>