Artificial Intelligence › Core AI ›

Safety

317 directly classified papers

Papers per year

Papers

Certification of Speaker Recognition Models to Additive Perturbations AAAI 2025

Investigating the Security Threat Arising from “Yes-No” Implicit Bias in Large Language Models AAAI 2025

FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts AAAI 2025

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? AAAI 2025

NLSR: Neuron-Level Safety Realignment of Large Language Models Against Harmful Fine-Tuning AAAI 2025

Quantitative Predictive Monitoring and Control for Safe Human-Machine Interaction AAAI 2025

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models AAAI 2025

Scaling Trends for Data Poisoning in LLMs AAAI 2025

Verification of Neural Networks Against Convolutional Perturbations via Parameterised Kernels AAAI 2025

LEGEND: Leveraging Representation Engineering to Annotate Safety Margin for Preference Datasets AAAI 2025

SMLE: Safe Machine Learning via Embedded Overapproximation AAAI 2025

Multi-Agent Security Tax: Trading Off Security and Collaboration Capabilities in Multi-Agent Systems AAAI 2025

SafetyPrompts: A Systematic Review of Open Datasets for Evaluating and Improving Large Language Model Safety AAAI 2025

MMJ-Bench: A Comprehensive Study on Jailbreak Attacks and Defenses for Vision Language Models AAAI 2025

Multimodal Pragmatic Jailbreak on Text-to-image Models ACL 2025

Evaluation of LLM Vulnerabilities to Being Misused for Personalized Disinformation Generation ACL 2025

Jailbreak Large Vision-Language Models Through Multi-Modal Linkage ACL 2025

MPO: Multilingual Safety Alignment via Reward Gap Optimization ACL 2025

Exploring the Impact of Instruction-Tuning on LLM’s Susceptibility to Misinformation ACL 2025

Do We Really Need Curated Malicious Data for Safety Alignment in Multi-modal Large Language Models? CVPR 2025

What Really Matters in Many-Shot Attacks? An Empirical Study of Long-Context Vulnerabilities in LLMs ACL 2025

Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models ACL 2025

Combining Domain and Alignment Vectors Provides Better Knowledge-Safety Trade-offs in LLMs ACL 2025

Defensive Prompt Patch: A Robust and Generalizable Defense of Large Language Models against Jailbreak Attacks ACL 2025

PL-Guard: Benchmarking Language Model Safety for Polish ACL 2025