A Joint End-to-End and DNN-HMM Hybrid Automatic Speech Recognition System with Transferring Sharable Knowledge
Abstract
This paper presents joint end-to-end and deep neural network-hidden Markov model (DNN-HMM) hybrid automatic speech recognition (ASR) systems that share network components. End-to-end ASR systems have been shown competitive performance compared with the DNN-HMM hybrid ASR systems in recent studies. These systems have different advantages, which are an estimation ability based on the totally optimized model of the end-to-end ASR system and a stable processing based on a frame-by-frame manner of the DNN-HMM hybrid ASR system. In our previous study, we proposed a method to utilize an end-to-end ASR system for rescoring hypotheses generated from a DNN-HMM hybrid ASR system. However, the conventional method cannot efficiently leverage the advantages since network components are independently modeled. In order to tackle this problem, we propose a joint end-to-end and DNN-HMM hybrid ASR systems that share the network to transfer knowledge of the systems. In the proposed method, end-to-end ASR systems utilize the information from an output of an internal layer in a DNN acoustic model in the DNN-HMM hybrid ASR system for enhancing the end-to-end ASR system. This enables us to efficiently leverage sharable information for improving the joint ASR system. Experimental results show that the proposed method outperforms the conventional method.