TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

Junzuo Zhou; Jiangyan Yi; Tao Wang; Jianhua Tao; Ye Bai; Chu Yuan Zhang; Yong Ren; Zhengqi Wen

2024 INTERSPEECH INTERSPEECH 2024

TraceableSpeech: Towards Proactively Traceable Text-to-Speech with Watermarking

Abstract

Various threats posed by the progress in text-to-speech (TTS) have prompted the need to reliably trace synthesized speech. However, contemporary approaches to this task involve adding watermarks to the audio separately after generation, a process that hurts both speech quality and watermark imperceptibility. In addition, these approaches are limited in robustness and flexibility. To address these problems, we propose TraceableSpeech, a novel TTS model that directly generates watermarked speech, improving watermark imperceptibility and speech quality. Furthermore, We design the frame-wise imprinting and extraction of watermarks, achieving higher robustness against resplicing attacks and temporal flexibility in operation. Experimental results show that TraceableSpeech outperforms the strong baseline where VALL-E or HiFicodec individually uses WavMark in watermark imperceptibility, speech quality and resilience against resplicing attacks. It also can apply to speech of various durations.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Speech & Audio

🧭 Keyword Pioneer — watermark extraction

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Speech & Audio

Authors

Junzuo Zhou , Jiangyan Yi , Tao Wang , Jianhua Tao , Ye Bai , Chu Yuan Zhang , Yong Ren , Zhengqi Wen

Topics

Artificial Intelligence > Core AI > Responsible AI Speech & Audio > Synthesis > Text-to-Speech

Keywords

latent representation audio watermarking text-to-speech synthesis watermark extraction resplicing attack

Download PDF

Related papers

Reshape Dimensions Network for Speaker Recognition 2024

RevRIR: Joint Reverberant Speech and Room Impulse Response Embedding using Contrastive Learning with Application to Room Shape Classification 2024

Mixed Children/Adult/Childrenized Fine-Tuning for Children’s ASR: How to Reduce Age Mismatch and Speaking Style Mismatch 2024

Exploring Speech Foundation Models for Speaker Diarization in Child-Adult Dyadic Interactions 2024

K-means and hierarchical clustering of f0 contours 2024