Aligning Speech Segments Beyond Pure Semantics

Kevin Heffernan; Artyom Kozhevnikov; Loic Barrault; Alexandre Mourachko; Holger Schwenk

2024 ACL ACL 2024

Aligning Speech Segments Beyond Pure Semantics

Abstract

AbstractMultilingual parallel data for speech-to-speech translation is scarce and expensive to create from scratch. This is all the more true for expressive speech translation, which aims at preserving not only the semantics, but also the overall prosody (e.g. style, emotion, rate-of-speech). Existing corpora contain speech utterances with the same meaning, yet the overall prosody is typically different, as human annotators are not tasked with reproducing these aspects, or crowed-sourced efforts do not specifically target this kind of alignment in priority. In this paper, we propose a novel alignment algorithm, which automatically forms pairs of speech segments aligned not only in meaning, but also in expressivity. In order to validate our approach, we train an expressive multilingual speech-to-speech translation system on the automatically aligned data. Our experiments show that in comparison to semantic-only approaches, expressively aligned data yields large improvements in source expressivity preservation (e.g. 43% uplift in speech rate preservation on average), while still maintaining content translation quality. In some scenarios, results also indicate that this alignment algorithm can outperform standard, semantic-focused approaches even on content translation quality.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Machine Learning and Natural Language Processing and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Security & Privacy, Speech & Audio

Authors

Kevin Heffernan , Artyom Kozhevnikov , Loic Barrault , Alexandre Mourachko , Holger Schwenk

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Core Methods > Metric Learning Machine Learning > Application Areas > Domain Adaptation Deep Learning > Models > Generative Models Natural Language Processing > Applications > Machine Translation Speech & Audio > Synthesis > Speech Enhancement Natural Language Processing > Generation > Machine Translation Speech & Audio > Synthesis > Speech Synthesis Speech & Audio > Processing > Speech Enhancement

Keywords

multilingual translation semantic alignment speech-to-speech translation expressive speech speech alignment prosody preservation

Download PDF

Related papers

Reinforcement Learning-Driven LLM Agent for Automated Attacks on LLMs 2024

EtymoLink: A Structured English Etymology Dataset 2024

Turkish Delights: A Dataset on Turkish Euphemisms 2024

Subjectivity Detection in English News using Large Language Models 2024

Does DetectGPT Fully Utilize Perturbation? Bridging Selective Perturbation to Fine-tuned Contrastive Learning Detector would be Better 2024