Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Yongqi Wang; Ruofan Hu; Rongjie Huang; Zhiqing Hong; Ruiqi Li; Wenrui Liu; Fuming You; Tao Jin; Zhou Zhao

2024 NAACL NAACL 2024

Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt

Abstract

AbstractRecent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — pitch representation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Yongqi Wang , Ruofan Hu , Rongjie Huang , Zhiqing Hong , Ruiqi Li , Wenrui Liu , Fuming You , Tao Jin , Zhou Zhao

Topics

Machine Learning > Application Areas > Domain Adaptation Deep Learning > Architectures > Transformers Deep Learning > Models > Generative Models

Keywords

multilingual nlp singing voice synthesis controllable synthesis attribute control decoder-only transformer natural language prompt pitch representation vocal range control

Download PDF

Related papers

Working Alliance Transformer for Psychotherapy Dialogue Classification 2024

Named Entity Recognition Under Domain Shift via Metric Learning for Life Sciences 2024

Assessing Logical Puzzle Solving in Large Language Models: Insights from a Minesweeper Case Study 2024

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation 2024

Extractive Summarization with Text Generator 2024