Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents

Antoine Bruguier; Heiga Zen; Arkady Arkhangorodsky

2018 INTERSPEECH INTERSPEECH 2018

Sequence-to-sequence Neural Network Model with 2D Attention for Learning Japanese Pitch Accents

Abstract

Many Japanese text-to-speech (TTS) systems use word-level pitch accents as one of the prosodic features. Combination of a pronunciation dictionary including lexical pitch accents and a statistical model representing the word accent sandhi is often used to predict pitch accents from a text. However, using human transcribers to build the dictionary and training data for the model is tedious and expensive. This paper proposes a neural pitch accent recognition model. This model combines the information from audio and its transcription (word sequence in hiragana characters) via two-dimensional attention and outputs word-level pitch accents. Experimental results show a reduction in the word pitch accent prediction error rate over that with text only. It lowers the load of human annotators when building a pronunciation dictionary. As the approach is general, it can be used to do pronunciation learning in other languages as well.

🧭 Keyword Pioneer — japanese pitch accent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Antoine Bruguier , Heiga Zen , Arkady Arkhangorodsky

Topics

Speech & Audio > Synthesis > Text-to-Speech Speech & Audio > Analysis > Prosody Analysis

Keywords

pitch accent prosody prediction sequence-to-sequence model neural network japanese pitch accent two-dimensional attention

Download PDF

Related papers

HoloCompanion: An MR Friend for EveryOne 2018

Estimation of the Vocal Tract Length of Vowel Sounds Based on the Frequency of the Significant Spectral Valley 2018

Deep Learning Techniques for Koala Activity Detection 2018

An Exploration of Local Speaking Rate Variations in Mandarin Read Speech 2018

Acoustic Analysis of Whispery Voice Disguise in Mandarin Chinese 2018