Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System

Tim Capes; Paul Coles; Alistair Conkie; Ladan Golipour; Abie Hadjitarkhani; Qiong Hu; Nancy Huddleston; Melvyn Hunt; Jiangchuan Li; Matthias Neeracher; Kishore Prahallad; Tuomo Raitio; Ramya Rasipuram; Greg Townsend; Becci Williamson; David Winarsky; Zhizheng Wu; Hepeng Zhang

2017 INTERSPEECH INTERSPEECH 2017

Siri On-Device Deep Learning-Guided Unit Selection Text-to-Speech System

Abstract

This paper describes Apple’s hybrid unit selection speech synthesis system, which provides the voices for Siri with the requirement of naturalness, personality and expressivity. It has been deployed into hundreds of millions of desktop and mobile devices (e.g. iPhone, iPad, Mac, etc.) via iOS and macOS in multiple languages. The system is following the classical unit selection framework with the advantage of using deep learning techniques to boost the performance. In particular, deep and recurrent mixture density networks are used to predict the target and concatenation reference distributions for respective costs during unit selection. In this paper, we present an overview of the run-time TTS engine and the voice building process. We also describe various techniques that enable on-device capability such as preselection optimization, caching for low latency, and unit pruning for low footprint, as well as techniques that improve the naturalness and expressivity of the voice such as the use of long units.

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

🧭 Keyword Pioneer — on-device inference

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Tim Capes , Paul Coles , Alistair Conkie , Ladan Golipour , Abie Hadjitarkhani , Qiong Hu , Nancy Huddleston , Melvyn Hunt , Jiangchuan Li , Matthias Neeracher , Kishore Prahallad , Tuomo Raitio , Ramya Rasipuram , Greg Townsend , Becci Williamson , David Winarsky , Zhizheng Wu , Hepeng Zhang

Topics

Deep Learning > Architectures > Neural Networks Deep Learning > Models > Generative Models Speech & Audio > Synthesis > Text-to-Speech

Keywords

deep learning text-to-speech synthesis on-device inference mixture density network unit selection

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017