One-Pass Single-Channel Noisy Speech Recognition Using a Combination of Noisy and Enhanced Features
Abstract
This paper introduces a method of noise-robust automatic speech recognition (ASR) that remains effective under one-pass single-channel processing. Under these constraints, the use of single-channel speech enhancement seems to be a reasonable noise-robust approach to ASR, because complicated techniques requiring multi-pass processing cannot be used. However, in many cases, single-channel speech enhancement seriously deteriorates the accuracy of ASR because of speech distortion. In addition, the advanced acoustic modeling framework (joint training) is relatively ineffective in the case of single-channel processing. To overcome these problems, we propose a noise-robust acoustic modeling framework based on a feature-level combination of noisy speech and enhanced speech. To obtain further improvements, we also adopt a sub-network-level combination of noisy and enhanced speech, and a gating mechanism that can dynamically select appropriate speech features. Through comparative evaluations, we confirm that the proposed method successfully improves the accuracy of ASR in noisy environments under strong constraints.