2024 INTERSPEECH INTERSPEECH 2024

Enhancing Speech and Music Discrimination Through the Integration of Static and Dynamic Features

Abstract

Audio is inherently temporal data, where features extracted from each segment evolve over time, yielding dynamic traits. These dynamics, relative to the acoustic characteristics inherent in raw audio features, primarily serve as complementary aids for audio classification. This paper employs the reservoir computing model to fit the audio feature sequences efficiently, capturing feature-sequence dynamics into the readout models, and without the need for offline iterative training. Additionally, stacked autoencoders further integrate the extracted static features (i.e., raw audio features) with the captured dynamics, resulting in more stable and effective classification performance. The entire framework is called Static-Dynamic Integration Network (SDIN). The conducted experiments demonstrate the effectiveness of SDIN in speech-music classification tasks.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — static-dynamic integration
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Security & Privacy, Speech & Audio