FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis

Taejun Bak; Jae-Sung Bae; Hanbin Bae; Young-Ik Kim; Hoon-Young Cho

2021 INTERSPEECH INTERSPEECH 2021

FastPitchFormant: Source-Filter Based Decomposed Modeling for Speech Synthesis

Abstract

Methods for modeling and controlling prosody with acoustic features have been proposed for neural text-to-speech (TTS) models. Prosodic speech can be generated by conditioning acoustic features. However, synthesized speech with a large pitch-shift scale suffers from audio quality degradation, and speaker characteristics deformation. To address this problem, we propose a feed-forward Transformer based TTS model that is designed based on the source-filter theory. This model, called FastPitchFormant, has a unique structure that handles text and acoustic features in parallel. With modeling each feature separately, the tendency that the model learns the relationship between two features can be mitigated. Owing to its structural characteristics, FastPitchFormant is robust and accurate for pitch control and generates prosodic speech preserving speaker characteristics. The experimental results show that proposed model outperforms the baseline FastPitch.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Taejun Bak , Jae-Sung Bae , Hanbin Bae , Young-Ik Kim , Hoon-Young Cho

Topics

Deep Learning > Architectures > Transformers Speech & Audio > Synthesis > Text-to-Speech

Keywords

speech synthesis text-to-speech synthesis speech generation prosody modeling pitch control source-filter theory source-filter model

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021