2023 INTERSPEECH INTERSPEECH 2023

Joint Time and Frequency Transformer for Chinese Opera Classification

Abstract

Transformer has recently gained more attention and is widely used in audio tasks. Most tasks compute attention directly over the entire time-frequency space or only in the temporal. This paper presents a joint time and frequency model for Chinese opera classification. A shallow convolutional block is used to get localized low-level semantic features and reduce the feature map size. Moreover, the criss-cross attention and the factorised self-attention are employed in the model to extract the time and frequency space representation. The experiment results demonstrate that the proposed model achieves state-of-the-art performance on a large Chinese opera dataset with fewer model parameters.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — joint time and frequency
🐝 Cross-Pollinator — Computer Vision, Deep Learning, Machine Learning, Speech & Audio

Authors