Artificial Intelligence › Core AI ›

Safety

317 directly classified papers

Papers per year

Papers

Pure-Past Action Masking AAAI 2024

Safe-Embed: Unveiling the Safety-Critical Knowledge of Sentence Encoders ACL 2024

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification ACL 2024

Reward Certification for Policy Smoothed Reinforcement Learning AAAI 2024

CodeAttack: Revealing Safety Generalization Challenges of Large Language Models via Code Completion ACL 2024

Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions ACL 2024

Accelerating Adversarially Robust Model Selection for Deep Neural Networks via Racing AAAI 2024

Athena: Safe Autonomous Agents with Verbal Contrastive Learning EMNLP 2024

Making Harmful Behaviors Unlearnable for Large Language Models ACL 2024

SpeechGuard: Exploring the Adversarial Robustness of Multi-modal Large Language Models ACL 2024

DeepBern-Nets: Taming the Complexity of Certifying Neural Networks Using Bernstein Polynomial Activations and Precise Bound Propagation AAAI 2024

All Languages Matter: On the Multilingual Safety of LLMs ACL 2024

Subtle Signatures, Strong Shields: Advancing Robust and Imperceptible Watermarking in Large Language Models ACL 2024

Balance Reward and Safety Optimization for Safe Reinforcement Learning: A Perspective of Gradient Manipulation AAAI 2024

SALAD-Bench: A Hierarchical and Comprehensive Safety Benchmark for Large Language Models ACL 2024

A Chinese Dataset for Evaluating the Safeguards in Large Language Models ACL 2024

Constrained Meta-Reinforcement Learning for Adaptable Safety Guarantee with Differentiable Convex Programming AAAI 2024

UNIWIZ: A Unified Large Language Model Orchestrated Wizard for Safe Knowledge Grounded Conversations ACL 2024

Realistic Evaluation of Toxicity in Large Language Models ACL 2024

Watermarking Conditional Text Generation for AI Detection: Unveiling Challenges and a Semantic-Aware Watermark Remedy AAAI 2024

On the Hallucination in Simultaneous Machine Translation ACL 2024

Emulated Disalignment: Safety Alignment for Large Language Models May Backfire! ACL 2024

Stable Unlearnable Example: Enhancing the Robustness of Unlearnable Examples via Stable Error-Minimizing Noise AAAI 2024

PsySafe: A Comprehensive Framework for Psychological-based Attack, Defense, and Evaluation of Multi-agent System Safety ACL 2024

ArtPrompt: ASCII Art-based Jailbreak Attacks against Aligned LLMs ACL 2024