LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models

Zuhair Hasan Shaik; Pradyoth Hegde; Prashant Bannulmath; Deepak K T

2024 EMNLP EMNLP 2024

LaRA: Large Rank Adaptation for Speech and Text Cross-Modal Learning in Large Language Models

Abstract

AbstractIntegrating speech and text capabilities into large language models (LLMs) is a challenging task and we present Large Rank Adaptation (LaRA) for effective cross-modal integration of speech and text in the LLM framework. Unlike conventional LoRA, our method requires significantly larger ranks comparable to the pretrained weights to accommodate the complexities of speech-text cross-modality learning. The approach utilizes HuBERT to convert speech into discrete tokens and fine-tunes the pretrained LLM to adapt to cross-modal inputs and outputs. The work employs a Hi-Fi GAN vocoder to synthesize speech waveforms from the generated speech units. The initial studies use the Librispeech corpus to teach the model the relationships between speech and text, and Daily Talk, which involves dialog conversations, to adapt for interaction. The proposed work demonstrates adaptation for spoken and text conversations. However, the proposed framework can be easily extended to other cross-modal applications.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zuhair Hasan Shaik , Pradyoth Hegde , Prashant Bannulmath , Deepak K T

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Application Areas > Model Merging Speech & Audio > Recognition > Speech Recognition Deep Learning > Learning Types > Transfer Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

speech synthesis speech recognition cross-modal learning low-rank adaptation parameter efficient fine-tuning speech tokenization vocoder synthesis large language model

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024