UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner

Dongchao Yang; Haohan Guo; Yuanyuan Wang; Rongjie Huang; Xiang Li; Xu Tan; Xixin Wu; Helen Meng

2024 NIPS NeurIPS 2024

UniAudio 1.5: Large Language Model-Driven Audio Codec is A Few-Shot Audio Task Learner

Abstract

Large Language models (LLMs) have demonstrated supreme capabilities in textual understanding and generation, but cannot be directly applied to cross-modal tasks without fine-tuning. This paper proposes a cross-modal in-context learning approach, empowering the frozen LLMs to achieve multiple audio tasks in a few-shot style without any parameter update. Specifically, we propose a novel LLM-driven audio codec model, LLM-Codec, which transfers the audio modality into textual space by representing audio tokens with words or sub-words from the LLM vocabulary, while maintaining high audio reconstruction quality.The key idea is to reduce the modality heterogeneity between text and audio by compressing the audio modality into the well-trained textual space of LLMs. Thus, the audio representation can be viewed as a new \textit{foreign language}, and LLMs can learn the new \textit{foreign language} with several demonstrations. In experiments, we investigate the performance of the proposed approach across multiple audio understanding and generation tasks, \textit{e.g.} speech emotion classification, audio classification, text-to-speech generation, speech enhancement, etc. Experimental results show that LLMs equipped with the LLM-Codec, named as UniAudio 1.5, prompted by only a few examples, can perform effectively in simple scenarios, validating our cross-modal in-context learning approach.To facilitate research on few-shot audio task learning and multi-modal LLMs, we have open-sourced the LLM-Codec model.

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Healthcare & Medicine, Machine Learning, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — text-to-speech generation

Authors

Dongchao Yang , Haohan Guo , Yuanyuan Wang , Rongjie Huang , Xiang Li , Xu Tan , Xixin Wu , Helen Meng

Topics

Artificial Intelligence > Core AI > Multimodal Learning Artificial Intelligence > Learning Paradigms > Few-Shot Learning Speech & Audio > Synthesis > Text-to-Speech Machine Learning > Learning Types > Few-Shot Learning Speech & Audio > Analysis > Speech Analysis Deep Learning > Models > Large Language Models Deep Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > In-Context Learning

Keywords

few-shot learning in-context learning cross-modal learning speech enhancement audio codec speech emotion classification large language model text-to-speech generation

Download PDF

Related papers

SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers 2024

Training for Stable Explanation for Free 2024

NeuralSolver: Learning Algorithms For Consistent and Efficient Extrapolation Across General Tasks 2024

Expectation Alignment: Handling Reward Misspecification in the Presence of Expectation Mismatch 2024

MicroAdam: Accurate Adaptive Optimization with Low Space Overhead and Provable Convergence 2024