EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Kai Chen; Yunhao Gou; Runhui Huang; Zhili Liu; Daxin Tan; Jing Xu; Chunwei Wang; Yi Zhu; Yihan Zeng; Kuo Yang; Dingdong Wang; Kun Xiang; Haoyuan Li; Haoli Bai; Jianhua Han; Xiaohui Li; Weike Jin; Nian Xie; Yu Zhang; James T. Kwok; Hengshuang Zhao; Xiaodan Liang; Dit-Yan Yeung; Xiao Chen; Zhenguo Li; Wei Zhang; Qun Liu; Lanqing Hong; Lu Hou; Hang Xu

2025 CVPR CVPR 2025

EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

👥 Mega-Team — 30 authors

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning

🧭 Keyword Pioneer — omni-modal model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Kai Chen , Yunhao Gou , Runhui Huang , Zhili Liu , Daxin Tan , Jing Xu , Chunwei Wang , Yi Zhu , Yihan Zeng , Kuo Yang , Dingdong Wang , Kun Xiang , Haoyuan Li , Haoli Bai , Jianhua Han , Xiaohui Li , Weike Jin , Nian Xie , Yu Zhang , James T. Kwok , Hengshuang Zhao , Xiaodan Liang , Dit-Yan Yeung , Xiao Chen , Zhenguo Li , Wei Zhang , Qun Liu , Lanqing Hong , Lu Hou , Hang Xu

Topics

Artificial Intelligence > Core AI > Foundation Models Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Foundation Models

Keywords

speech synthesis speech processing multimodal learning emotion recognition foundation model vision-language model spoken dialogue vision language omni-modal model emotion generation

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025