Enhancing Speech and Music Discrimination Through the Integration of Static and Dynamic Features

Liangwei Chen; Xiren Zhou; Qiang Tu; Huanhuan Chen

2024 INTERSPEECH INTERSPEECH 2024

Enhancing Speech and Music Discrimination Through the Integration of Static and Dynamic Features

Abstract

Audio is inherently temporal data, where features extracted from each segment evolve over time, yielding dynamic traits. These dynamics, relative to the acoustic characteristics inherent in raw audio features, primarily serve as complementary aids for audio classification. This paper employs the reservoir computing model to fit the audio feature sequences efficiently, capturing feature-sequence dynamics into the readout models, and without the need for offline iterative training. Additionally, stacked autoencoders further integrate the extracted static features (i.e., raw audio features) with the captured dynamics, resulting in more stable and effective classification performance. The entire framework is called Static-Dynamic Integration Network (SDIN). The conducted experiments demonstrate the effectiveness of SDIN in speech-music classification tasks.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — static-dynamic integration

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio

Authors

Liangwei Chen , Xiren Zhou , Qiang Tu , Huanhuan Chen

Topics

Machine Learning > Core Methods > Representation Learning Machine Learning > Optimization & Theory > Stochastic Processes Deep Learning > Architectures > Autoencoders

Keywords

reservoir computing temporal feature stacked autoencoder static-dynamic integration speech-music discrimination

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024