2017 INTERSPEECH INTERSPEECH 2017

Improving Speech Recognizers by Refining Broadcast Data with Inaccurate Subtitle Timestamps

Abstract

This paper proposes an automatic method to refine broadcast data collected every week for efficient acoustic model training. For training acoustic models, we use only audio signals, subtitle texts, and subtitle timestamps accompanied by recorded broadcast programs. However, the subtitle timestamps are often inaccurate due to inherent characteristics of closed captioning. In the proposed method, we remove subtitle texts with low subtitle quality index, concatenate adjacent subtitle texts into a merged subtitle text, and correct the timestamp of the merged subtitle text by adding a margin. Then, a speech recognizer is used to obtain a hypothesis text from the speech segment corresponding to the merged subtitle text. Finally, the refined speech segments to be used for acoustic model training, are generated by selecting the subparts of the merged subtitle text that matches the hypothesis text. It is shown that the acoustic models trained by using refined broadcast data give significantly higher speech recognition accuracy than those trained by using raw broadcast data. Consequently, the proposed method can efficiently refine a large amount of broadcast data with inaccurate timestamps taking about half of the time, compared with the previous approaches.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio
🧭 Keyword Pioneer — subtitle timestamp
🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio