Code-Switching Sentence Generation by Bert and Generative Adversarial Networks
Abstract
Code-switching has become a common linguistic phenomenon. Comparing to monolingual ASR tasks, insufficient data is a major challenge for code-switching speech recognition. In this paper, we propose an approach to compositionally employ the Bidirectional Encoder Representations from Transformers (Bert) model and Generative Adversarial Net (GAN) model for code-switching text data generation. It improves upon previous work by (1) applying Bert as a masked language model to predict the mixed-in foreign words and (2) basing on the GAN framework with Bert for both the generator and discriminator to further assure the generated sentences similar enough to the natural examples. We evaluate the effectiveness of the generated data by its contribution to ASR. Experiments show our approach can reduce the English word error rate by 1.5% with the Mandarin-English code-switching spontaneous speech corpus OC16-CE80.