2019
ACL
ACL 2019
Exploring Phoneme-Level Speech Representations for End-to-End Speech Translation
Abstract
AbstractPrevious work on end-to-end translation from speech has primarily used frame-level features as speech representations, which creates longer, sparser sequences than text. We show that a naive method to create compressed phoneme-like speech representations is far more effective and efficient for translation than traditional frame-level speech features. Specifically, we generate phoneme labels for speech frames and average consecutive frames with the same label to create shorter, higher-level source sequences for translation. We see improvements of up to 5 BLEU on both our high and low resource language pairs, with a reduction in training time of 60%. Our improvements hold across multiple data sizes and two language pairs.
🌉
Interdisciplinary Bridge
— Deep Learning and Machine Learning and Speech & Audio
📈
Trend Setter
— Speech Enhancement
🧭
Keyword Pioneer
— end-to-end speech
🐣
Hot Topic Early Bird
— speech translation
🐝
Cross-Pollinator
— Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio