Rainbow-Teaming for the Polish Language: A Reproducibility Study

Aleksandra Krasnodebska; Maciej Chrabaszcz; Wojciech Kusa

2025 NAACL NAACL 2025

Rainbow-Teaming for the Polish Language: A Reproducibility Study

Abstract

AbstractThe development of multilingual large language models (LLMs) presents challenges in evaluating their safety across all supported languages. Enhancing safety in one language (e.g., English) may inadvertently introduce vulnerabilities in others. To address this issue, we implement a methodology for the automatic creation of red-teaming datasets for safety evaluation in Polish language. Our approach generates both harmful and non-harmful prompts by sampling different risk categories and attack styles. We test several open-source models, including those trained on Polish data, and evaluate them using metrics such as Attack Success Rate (ASR) and False Reject Rate (FRR). The results reveal clear gaps in safety performance between models and show that better testing across languages is needed.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Natural Language Processing

🧭 Keyword Pioneer — red teaming dataset

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio