Exploring compressibility of transformer based text-to-music (TTM) models

Vasileios Moschopoulos; Thanasis Kotsiopoulos; Pablo Peso Parada; Konstantinos Nikiforidis; Alexandros Stergiadis; Gerasimos Papakostas; Md Asif Jalal; Jisi Zhang; Anastasios Drosou; Karthikeyan Saravanan

2024 INTERSPEECH INTERSPEECH 2024

Exploring compressibility of transformer based text-to-music (TTM) models

Abstract

State-of-the art Text-To-Music (TTM) generative AI models are large and require desktop or server class compute, making them infeasible for deployment on mobile phones. This paper presents an analysis of trade-offs between model compression and generation performance of TTM models. We study compression through knowledge distillation and specific modifications that enable applicability over the various components of the TTM model (encoder, generative model and the decoder). Leveraging these methods we create TinyTTM (89.2M params) that achieves a FAD of 3.66 and KL of 1.32 on MusicBench dataset, better than MusicGen-Small (557.6M params) but not lower than MusicGen-small fine-tuned on MusicBench.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Vasileios Moschopoulos , Thanasis Kotsiopoulos , Pablo Peso Parada , Konstantinos Nikiforidis , Alexandros Stergiadis , Gerasimos Papakostas , Md Asif Jalal , Jisi Zhang , Anastasios Drosou , Karthikeyan Saravanan

Topics

Artificial Intelligence > Core AI > Model Compression Machine Learning > Application Areas > Knowledge Distillation Deep Learning > Architectures > Transformers

Keywords

model compression knowledge distillation parameter efficiency text-to-music generation

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024