Unsupervised Concept Vector Extraction for Bias Control in LLMs

Hannah Cyberey; Yangfeng Ji; David Evans

2025 EMNLP EMNLP 2025

Unsupervised Concept Vector Extraction for Bias Control in LLMs

Abstract

AbstractLarge language models (LLMs) are known to perpetuate stereotypes and exhibit biases. Various strategies have been proposed to mitigate these biases, but most work studies biases as a black-box problem without considering how concepts are represented within the model. We adapt techniques from representation engineering to study how the concept of “gender” is represented within LLMs. We introduce a new method that extracts concept representations via probability weighting without labeled data and efficiently selects a steering vector for measuring and manipulating the model’s representation. We develop a projection-based method that enables precise steering of model predictions and demonstrate its effectiveness in mitigating gender bias in LLMs and show that it also generalizes to racial bias.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🧭 Keyword Pioneer — concept vector extraction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hannah Cyberey , Yangfeng Ji , David Evans

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Responsible AI Machine Learning > Core Methods > Representation Learning Machine Learning > Application Areas > Fairness Artificial Intelligence > Core AI > Large Language Models Artificial Intelligence > Core AI > Fairness Deep Learning > Techniques > Representation Learning

Keywords

bias mitigation representation engineering concept extraction steering vector model debiasing gender bia gender bias mitigation concept vector large language model concept vector extraction

Download PDF

Related papers

Bit-Flip Error Resilience in LLMs: A Comprehensive Analysis and Defense Framework 2025

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing 2025

Model-based Large Language Model Customization as Service 2025

ZoomEye: Enhancing Multimodal LLMs with Human-Like Zooming Capabilities through Tree-Based Image Exploration 2025

SlideCoder: Layout-aware RAG-enhanced Hierarchical Slide Generation from Design 2025