Large-Scale Automatic Audiobook Creation

Brendan Walsh; Mark Hamilton; Greg Newby; Xi Wang; Serena Ruan; Sheng Zhao; Lei He; Shaofei Zhang; Eric Dettinger; William T. Freeman; Markus Weimer

2023 INTERSPEECH INTERSPEECH 2023

Large-Scale Automatic Audiobook Creation

Abstract

An audiobook can dramatically improve a work of literature's accessibility and improve reader engagement. However, audiobooks can take hundreds of hours of human effort to create, edit, and publish. In this work, we present a system that can automatically generate high-quality audiobooks from online e-books. In particular, we leverage recent advances in neural text-to-speech to create and release thousands of human-quality, open-license audiobooks from the Project Gutenberg e-book collection. Our method can identify the proper subset of e-book content to read for a wide collection of diversely structured books and can operate on hundreds of books in parallel. Our system allows users to customize an audiobook's speaking speed and style, emotional intonation, and can even match a desired voice using a small amount of sample audio. This work contributed over five thousand open-license audiobooks and an interactive demo that allows users to quickly create their own customized audiobooks. To listen to the audiobook collection visit https://aka.ms/audiobook.

🧭 Keyword Pioneer — voice customization

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Machine Learning, Natural Language Processing, Speech & Audio

🌉 Interdisciplinary Bridge — Deep Learning and Speech & Audio

Authors

Brendan Walsh , Mark Hamilton , Greg Newby , Xi Wang , Serena Ruan , Sheng Zhao , Lei He , Shaofei Zhang , Eric Dettinger , William T. Freeman , Markus Weimer

Topics

Deep Learning > Models > Generative Models Speech & Audio > Synthesis > Text-to-Speech

Keywords

speech synthesis neural text-to-speech audiobook generation voice customization emotional intonation

Download PDF

Audio-Visual Praise Estimation for Conversational Video based on Synchronization-Guided Multimodal Transformer 2023

Improving the response timing estimation for spoken dialogue systems by reducing the effect of speech recognition delay 2023

Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation 2023

What are differences? Comparing DNN and Human by Their Performance and Characteristics in Speaker Age Estimation 2023

Large-Scale Automatic Audiobook Creation

Abstract

Authors

Topics

Keywords

Related papers