2026 EACL EACL 2026

Engagement Undermines Safety: How Stereotypes and Toxicity Shape Humor in Language Models

Abstract

AbstractLarge language models are increasingly used for creative writing and engagement content, raising safety concerns about their outputs. Using humor generation as a testbed, this work evaluates how funniness optimization in modern LLM pipelines couples with harmful content by jointly measuring humor, stereotypicality, and toxicity. We further supplement this by analyzing incongruity signals through information-theoretic metrics. Across six models, we observe that even for fixed neutral setups, harmful outputs receive higher humor scores, indicating a bias amplification loop between generators and evaluators. Information-theoretic analyses show that harmful cues widen predictive uncertainty and, surprisingly, can even make harmful punchlines more expected for some models, suggesting intrinsic structural embedding in learned humor distributions. Experiments and human evaluation on an additional satire-generation task with human-perceived funniness judgments show that LLM funniness relies on increased stereotypicality and toxicity, including for closed models. Quantitatively, stereotypical/toxic jokes gain 10%–21% in mean humor score, stereotypical jokes appear 11% to 28% more often among the jokes marked funny by an LLM-based metric, and up to 10% more often in generations perceived as funny by humans.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio