Artificial Intelligence › Core AI ›

Safety

317 directly classified papers

Papers per year

Papers

Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning AAAI 2025

COMMIT: Certifying Robustness of Multi-Sensor Fusion Systems Against Semantic Attacks AAAI 2025

FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts AAAI 2025

Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models AAAI 2025

Certification of Speaker Recognition Models to Additive Perturbations AAAI 2025

Scaling Trends for Data Poisoning in LLMs AAAI 2025

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage ACL 2025

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks AAAI 2025

Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling AAAI 2025

Leveraging Constraint Violation Signals for Action Constrained Reinforcement Learning AAAI 2025

Rethinking Byzantine Robustness in Federated Recommendation from Sparse Aggregation Perspective AAAI 2025

Quantitative Predictive Monitoring and Control for Safe Human-Machine Interaction AAAI 2025

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI 2025

SMLE: Safe Machine Learning via Embedded Overapproximation AAAI 2025

Refuse Whenever You Feel Unsafe: Improving Safety in LLMs via Decoupled Refusal Training ACL 2025

LEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets AAAI 2025

Verification of Neural Networks Against Convolutional Perturbations via Parameterised Kernels AAAI 2025

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems AAAI 2025

Stepwise Reasoning Disruption Attack of LLMs ACL 2025

MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models AAAI 2025

SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety AAAI 2025

Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? CVPR 2025

Enhancing Robustness in Incremental Learning with Adversarial Training AAAI 2025

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? AAAI 2025

Contrastive Perplexity for Controlled Generation: An Application in Detoxifying Large Language Models ACL 2025