Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speech

Anna Oura; Hideaki Kikuchi; Tetsunori Kobayashi

2024 INTERSPEECH INTERSPEECH 2024

Preprocessing for acoustic-to-articulatory inversion using real-time MRI movies of Japanese speech

Abstract

Acoustic-to-articulatory inversion (AAI) estimates the articulatory movements by using acoustic speech signals. The traditional AAI relies on indirect estimation using articulatory models. However, recent advancements have proposed the use of machine learning models to directly output real-time MRI (rtMRI) movies. This study applied the existing model to rtMRI movies of Japanese speech to test its potential for achieving highly accurate estimations using the devised preprocessing methods. Preprocessing involves normalization of face alignment and filtering to remove extraneous regions. For objective evaluation, we measured the complex wavelet structural similarity (CW--SSIM). The results indicate that combining the normalization and filtering processes can produce smooth rtMRI movies that closely resemble the original (average CW--SSIM: LSTM, 0.795; BLSTM, 0.793). Therefore, the effectiveness of the preprocessing was demonstrated.

🧭 Keyword Pioneer — wavelet structural similarity

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Speech & Audio

Authors

Anna Oura , Hideaki Kikuchi , Tetsunori Kobayashi

Topics

Machine Learning > Core Methods > Regression Machine Learning > Core Methods > Representation Learning

Keywords

deep learning model articulatory movement acoustic-to-articulatory inversion real-time mri wavelet structural similarity preprocessing normalization

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024