Artificial Intelligence › Core AI ›

Safety

317 directly classified papers

Papers per year

Papers

GuardBench: A Large-Scale Benchmark for Guardrail Models EMNLP 2024

Please note that I’m just an AI: Analysis of Behavior Patterns of LLMs in (Non-)offensive Speech Identification EMNLP 2024

From LLMs to MLLMs: Exploring the Landscape of Multimodal Jailbreaking EMNLP 2024

CoSafe: Evaluating Large Language Model Safety in Multi-Turn Dialogue Coreference EMNLP 2024

Distract Large Language Models for Automatic Jailbreak Attack EMNLP 2024

MLLM-Protector: Ensuring MLLM’s Safety without Hurting Performance EMNLP 2024

BaitAttack: Alleviating Intention Shift in Jailbreak Attacks via Adaptive Bait Crafting EMNLP 2024

Holistic Automated Red Teaming for Large Language Models through Top-Down Test Case Generation and Multi-turn Interaction EMNLP 2024

MedSafetyBench: Evaluating and Improving the Medical Safety of Large Language Models NIPS 2024

Defensive Unlearning with Adversarial Training for Robust Concept Erasure in Diffusion Models NIPS 2024

Fishing for Magikarp: Automatically Detecting Under-trained Tokens in Large Language Models EMNLP 2024

Red Teaming Language Models for Processing Contradictory Dialogues EMNLP 2024

Towards Probing Speech-Specific Risks in Large Multimodal Models: A Taxonomy, Benchmark, and Insights EMNLP 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis EMNLP 2024

Is Safer Better? The Impact of Guardrails on the Argumentative Strength of LLMs in Hate Speech Countering EMNLP 2024

ASETF: A Novel Method for Jailbreak Attack on LLMs through Translate Suffix Embeddings EMNLP 2024

VeriCompress: A Tool to Streamline the Synthesis of Verified Robust Compressed Neural Networks from Scratch AAAI 2024

Dataset and Lessons Learned from the 2024 SaTML LLM Capture-the-Flag Competition NIPS 2024

Mission Impossible: A Statistical Perspective on Jailbreaking LLMs NIPS 2024

Simplifying Constraint Inference with Inverse Reinforcement Learning NIPS 2024

Safe Reinforcement Learning with Instantaneous Constraints: The Role of Aggressive Exploration AAAI 2024

The Art of Saying No: Contextual Noncompliance in Language Models NIPS 2024

Unelicitable Backdoors via Cryptographic Transformer Circuits NIPS 2024

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models NIPS 2024

ColJailBreak: Collaborative Generation and Editing for Jailbreaking Text-to-Image Deep Generation NIPS 2024