Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme

Hieu-Thi Luong; Junichi Yamagishi

2023 INTERSPEECH INTERSPEECH 2023

Controlling Multi-Class Human Vocalization Generation via a Simple Segment-based Labeling Scheme

Abstract

As prompt-based generative models have received much attention, many studies have proposed a similar model for sound generation. While prompt-based generative models have an intuitive interface for non-professional users to experiment with, they lack the ability to control the generated sounds via a more direct means. In this work, we investigated the use of a simple segment-based labeling scheme for human vocalization generation, which is a specific subset of sound generation. By conditioning the generative models on the label sequence which marks the vocalization class of the segment, the generated sound can be controlled in a more detailed manner while maintaining a simple and intuitive input interface. Our experiments showed that simply switching the label scheme from global to segment-based does not degrade the quality of the generated samples in any way and provides a new method of controlling the generation process.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — vocalization generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio