2022 INTERSPEECH INTERSPEECH 2022

A study on constraining Connectionist Temporal Classification for temporal audio alignment

Abstract

Connectionist Temporal Classification (CTC) has become a standard for deep learning-based temporal alignment allowing relevant probabilistic distributions to be learned. However, by nature, CTC is a transcription objective that can be minimized without guaranteeing any alignment properties. This work aims to study several constraints to help CTC generating alignments. With a fully convolutional architecture coupled with multi-head attention, we investigate the task of phonetic alignment for clean speech and singing signals. The focus is set on the impact of additional losses, namely spectral envelope reconstruction, temporal structure invariance and guided monotony. Results show that, once scaled to have identical temporal dependence, combining all of these constraints produces best performances.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning
🧭 Keyword Pioneer — guided monotony
🐣 Hot Topic Early Bird — temporal alignment
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio