Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism

Genshun Wan; Jia Pan; Qingran Wang; Jianqing Gao; Zhongfu Ye

2020 INTERSPEECH INTERSPEECH 2020

Speaker Adaptive Training for Speech Recognition Based on Attention-Over-Attention Mechanism

Abstract

In our previous work, we introduced a speaker adaptive training method based on frame-level attention mechanism for speech recognition, which has been proved an effective way to do speaker adaptive training. In this paper, we present an improved method by introducing the attention-over-attention mechanism. This attention module is used to further measure the contribution of each frame to the speaker embeddings in an utterance, and then generate an utterance-level speaker embedding to perform speaker adaptive training. Compared with the frame-level ones, the generated utterance-level speaker embeddings are more representative and stable. Experiments on both the Switchboard and AISHELL-2 tasks show that our method can achieve a relative word error rate reduction of approximately 8.0% compared with the speaker independent model, and over 6.0% compared with the traditional utterance-level d-vector-based speaker adaptive training method.

🌉 Interdisciplinary Bridge — Machine Learning and Speech & Audio

🧭 Keyword Pioneer — attention-over-attention mechanism

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Healthcare & Medicine, Interdisciplinary, Machine Learning, Natural Language Processing, Reinforcement Learning, Security & Privacy, Speech & Audio

Authors

Genshun Wan , Jia Pan , Qingran Wang , Jianqing Gao , Zhongfu Ye

Topics

Machine Learning > Learning Types > Self-Supervised Learning Speech & Audio > Recognition > Speech Recognition

Keywords

speaker embedding end-to-end speech recognition speaker adaptive training utterance-level embedding attention-over-attention mechanism

Download PDF

Related papers

Memory Controlled Sequential Self Attention for Sound Recognition 2020

Dual Attention in Time and Frequency Domain for Voice Activity Detection 2020

Automatic Prediction of Speech Intelligibility Based on X-Vectors in the Context of Head and Neck Cancer 2020

A Noise Robust Technique for Detecting Vowels in Speech Signals 2020

Joint Detection of Sentence Stress and Phrase Boundary for Prosody 2020