Controlling formant frequencies with neural text-to-speech for the manipulation of perceived speaker age

Ziya Khan; Lovisa Wihlborg; Cassia Valentini-Botinhao; Oliver Watts

2023 INTERSPEECH INTERSPEECH 2023

Controlling formant frequencies with neural text-to-speech for the manipulation of perceived speaker age

Abstract

In this paper, we present a framework for formant-controllable neural text-to-speech. We train a model that predicts formant frequencies which then condition melspectrogram generation. We apply this to manipulate perceived speaker age in an indirect fashion, by modifying the predicted formants in a manner that affects perceived vocal tract length. Our ultimate goal is to allow for the control of perceived ageing in children's text-to-speech voices, since ageing in natural child speech is strongly linked to the growth of a child's vocal tract. However, our experiments indicate that our method shows strong age control capabilities for adult speech as well.

🧭 Keyword Pioneer — voice manipulation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Mathematics & Optimization, Speech & Audio