Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice

Yann Teytaut; Axel Roebel

2021 INTERSPEECH INTERSPEECH 2021

Phoneme-to-Audio Alignment with Recurrent Neural Networks for Speaking and Singing Voice

Abstract

Phoneme-to-audio alignment is the task of synchronizing voice recordings and their related phonetic transcripts. In this work, we introduce a new system to forced phonetic alignment with Recurrent Neural Networks (RNN). With the Connectionist Temporal Classification (CTC) loss as training objective, and an additional reconstruction cost, we learn to infer relevant per-frame phoneme probabilities from which alignment is derived. The core of the neural architecture is a context-aware attention mechanism between mel-spectrograms and side information. We investigate two contexts given by either phoneme sequences (model PhAtt) or spectrograms themselves (model SpAtt). Evaluations show that these models produce precise alignments for both speaking and singing voice. Best results are obtained with the model PhAtt, which outperforms baseline reference with an average imprecision of 16.3ms and 29.8ms on speech and singing, respectively. The model SpAtt also appears as an interesting alternative, capable of aligning longer audio files without requiring phoneme sequences on small audio segments.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — phoneme alignment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Yann Teytaut , Axel Roebel

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Transformers Deep Learning > Architectures > Neural Networks

Keywords

attention mechanism connectionist temporal classification recurrent neural network phoneme alignment phoneme-to-audio alignment

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021