Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Rongjie Huang; Chunlei Zhang; Yongqi Wang; Dongchao Yang; Jinchuan Tian; Zhenhui Ye; Luping Liu; Zehan Wang; Ziyue Jiang; Xuankai Chang; Jiatong Shi; Chao Weng; Zhou Zhao; Dong Yu

2024 ACL ACL 2024

Make-A-Voice: Revisiting Voice Large Language Models as Scalable Multilingual and Multitask Learners

Abstract

AbstractLarge language models (LLMs) have successfully served as a general-purpose interface across multiple tasks and languages, while the adaptation of voice LLMs is mostly designed for specific purposes (either single-task or monolingual), where the advantages of LLMs especially for low-resource language processing and zero-shot task generalization are less exploited in the audio community. To bridge the gap, we introduce Make-A-Voice as a multi-modal voice LLM and conduct a comprehensive study on its capability to deal with multiple tasks/languages. When trained on ~200K hours of 6-language data for 4 voice generation applications, Make-A-Voice emerges notable advantages: 1) as scalable learners to improve performance with end-to-end local and global multiscale transformers; and 2) as multitask learners by adjusting prompts to share common knowledge across modalities (speech/singing) and present in-context learning abilities by generalizing to unseen tasks not explicitly train on; 3) as multilingual learners to alleviate data scarcity of low-resource languages by including rich-resource language training data. Experimental results demonstrate that Make-A-Voice exhibits superior audio quality and style similarity compared with competitive baseline models in monolingual/cross-lingual voice generation. Audio samples are available at https://M-Voice.github.io

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Natural Language Processing and Speech & Audio

🧭 Keyword Pioneer — voice large language model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Rongjie Huang , Chunlei Zhang , Yongqi Wang , Dongchao Yang , Jinchuan Tian , Zhenhui Ye , Luping Liu , Zehan Wang , Ziyue Jiang , Xuankai Chang , Jiatong Shi , Chao Weng , Zhou Zhao , Dong Yu

Topics

Artificial Intelligence > Core AI > Multimodal Learning Natural Language Processing > Resources & Methods > Large Language Models Natural Language Processing > Resources & Methods > Multilingual NLP Speech & Audio > Synthesis > Text-to-Speech Artificial Intelligence > Core AI > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning

Keywords

zero-shot learning speech synthesis multimodal learning multitask learning multilingual learning voice generation voice large language model

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024