WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Rajath Rao; Adithya V Ganesan; Oscar Kjell; Jonah Luby; Akshay Raghavan; Scott Feltman; Whitney Ringwald; Ryan L. Boyd; Benjamin Luft; Camilo Ruggero; Neville Ryant; Roman Kotov; H. Andrew Schwartz

2025 ACL ACL 2025

WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Abstract

AbstractCurrent speech encoding pipelines often rely on an additional text-based LM to get robust representations of human communication, even though SotA speech-to-text models often have a LM within. This work proposes an approach to improve the LM within an audio model such that the subsequent text-LM is unnecessary. We introduce **WhiSPA** (**Whi**sper with **S**emantic and **P**sychological **A**lignment), which leverages a novel audio training objective: contrastive loss with a language model embedding as a teacher. Using over 500k speech segments from mental health audio interviews, we evaluate the utility of aligning Whisper’s latent space with semantic representations from a text autoencoder (SBERT) and lexically derived embeddings of basic psychological dimensions: emotion and personality. Over self-supervised affective tasks and downstream psychological tasks, WhiSPA surpasses current speech encoders, achieving an average error reduction of 73.4% and 83.8%, respectively. WhiSPA demonstrates that it is not always necessary to run a subsequent text LM on speech-to-text output in order to get a rich psychological representation of human communication.

🌉 Interdisciplinary Bridge — Interdisciplinary and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rajath Rao , Adithya V Ganesan , Oscar Kjell , Jonah Luby , Akshay Raghavan , Scott Feltman , Whitney Ringwald , Ryan L. Boyd , Benjamin Luft , Camilo Ruggero , Neville Ryant , Roman Kotov , H. Andrew Schwartz

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Learning Types > Contrastive Learning Interdisciplinary > Social > Affective Computing

Keywords

contrastive learning self-supervised learning speech recognition affective computing semantic embedding

Download PDF

Graphically Speaking: Unmasking Abuse in Social Media with Conversation Insights 2025

CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision 2025

Structural Deep Encoding for Table Question Answering 2025

Vision-aided Unsupervised Constituency Parsing with Multi-MLLM Debating 2025

WhiSPA: Semantically and Psychologically Aligned Whisper with Self-Supervised Contrastive and Student-Teacher Learning

Abstract

Authors

Topics

Keywords

Related papers