J-ToneNet: A Transformer-based Encoding Network for Improving Tone Classification in Continuous Speech via F0 Sequences

Yi-Fen Liu; Xiang-Li Lu

2023 INTERSPEECH INTERSPEECH 2023

J-ToneNet: A Transformer-based Encoding Network for Improving Tone Classification in Continuous Speech via F0 Sequences

Abstract

Currently, tone classification studies mainly focus on training classifiers by using intrinsic features of isolated segments, i.e. often the syllables. Mostly, the works are not merely in use of fundamental frequency (f0) but utilizing more information on the spectrograms, MFCCs, or energy to improve model accuracy. However, as we know, more challenges on tone classification lie on modeling the complex f0 variations from the tonal coarticulations and the interactive effects among tonality in continuous speech. To tackle down this issue, we first aim at in using the sequence of f0 samples in speech utterance only. In addition, we propose a transformer based network with an extendable BERT input architecture and a joint learning technique to consolidate the contour representations of consecutive tones. Leveraging or fusing more information affected from speech rhythm in utterance, the experiments show that the proposed J-ToneNet is very robust for read speech.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Interdisciplinary and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio