Artificial Intelligence › Core AI ›

Safety

317 directly classified papers

Papers per year

Papers

Can Watermarking Large Language Models Prevent Copyrighted Text Generation and Hide Training Data? AAAI 2025

LLMs Caught in the Crossfire: Malware Requests and Jailbreak Challenges ACL 2025

LongSafety: Evaluating Long-Context Safety of Large Language Models ACL 2025

Unintended Harms of Value-Aligned LLMs: Psychological and Empirical Insights ACL 2025

SafeInfer: Context Adaptive Decoding Time Safety Alignment for Large Language Models AAAI 2025

Guardrails and Security for LLMs: Safe, Secure and Controllable Steering of LLM Applications ACL 2025

Scaling Trends for Data Poisoning in LLMs AAAI 2025

SafeRoute: Adaptive Model Selection for Efficient and Accurate Safety Guardrails in Large Language Models ACL 2025

Quantitative Predictive Monitoring and Control for Safe Human-Machine Interaction AAAI 2025

Rewrite to Jailbreak: Discover Learnable and Transferable Implicit Harmfulness Instruction ACL 2025

Tell Me What You Don’t Know: Enhancing Refusal Capabilities of Role-Playing Agents via Representation Space Analysis and Editing ACL 2025

Why Not Act on What You Know? Unleashing Safety Potential of LLMs via Self-Aware Guard Enhancement ACL 2025

Beware of Your Po! Measuring and Mitigating AI Safety Risks in Role-Play Fine-Tuning of LLMs ACL 2025

Chinese SafetyQA: A Safety Short-form Factuality Benchmark for Large Language Models ACL 2025

FigStep: Jailbreaking Large Vision-Language Models via Typographic Visual Prompts AAAI 2025

Scalable Surrogate Verification of Image-Based Neural Network Control Systems Using Composition and Unrolling AAAI 2025

Enhancing Robustness in Incremental Learning with Adversarial Training AAAI 2025

PROSAC: Provably Safe Certification for Machine Learning Models under Adversarial Attacks AAAI 2025

Rethinking Byzantine Robustness in Federated Recommendation from Sparse Aggregation Perspective AAAI 2025

Shield Synthesis for LTL Modulo Theories AAAI 2025

Leveraging Constraint Violation Signals for Action Constrained Reinforcement Learning AAAI 2025

Constraint-Adaptive Policy Switching for Offline Safe Reinforcement Learning AAAI 2025

Probabilistic Shielding for Safe Reinforcement Learning AAAI 2025

Offline Safe Reinforcement Learning Using Trajectory Classification AAAI 2025

Defense Against Prompt Injection Attack by Leveraging Attack Techniques ACL 2025