Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRI

Yaoyao Yue; Michael Proctor; Luping Zhou; Rijul Gupta; Tharinda Piyadasa; Amelia Gully; Kirrie Ballard; Craig Jin

2024 INTERSPEECH INTERSPEECH 2024

Towards Speech Classification from Acoustic and Vocal Tract data in Real-time MRI

Abstract

Real-time magnetic resonance image (rtMRI) data of the upper airway provides a rich source of information about vocal tract shaping that can inform phonemic analysis and classification. We describe a multimodal phonemic classifier that combines articulatory data with speech audio features to improve performance. A deep network model processes rtMRI video data using ResNet18 and speech audio using a custom CNN and then combines the two data streams using a Transformer layer to fully explore the correlation of the two streams towards better vowel-consonant-vowel classification via the Transformer's multi-head self-attention mechanism. The classification accuracy of both the unimodal and multimodal models show substantial improvement on previous work (> 38%). The addition of audio features improves classification accuracy in the multimodal model by 7% compared with the unimodal model using articulatory data. We analyze the model and discuss the phonetic implications.

🧭 Keyword Pioneer — phonemic classification

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Yaoyao Yue , Michael Proctor , Luping Zhou , Rijul Gupta , Tharinda Piyadasa , Amelia Gully , Kirrie Ballard , Craig Jin

Topics

Deep Learning > Architectures > Transformers Deep Learning > Architectures > Neural Networks

Keywords

multi-head self-attention real-time magnetic resonance imaging articulatory datum phonemic classification multimodal classifier

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024