The Impact of Dialect Variation on Robust Automatic Speech Recognition for Catalan
Abstract
AbstractTo accurately transcribe a speech signal, automatic speech recognition (ASR) systems must show robustness to a wide range of task independent variation, such as speaker factors, recording quality, or even ädversarial noisedesigned to disrupt performance.We manipulated the dialect composition of fine-tuning data for ASR to study whether balancing the relative proportion of dialects had an impact on models robustness to two such sources of variation”:" dialect variation and adversarial perturbations. We fine-tuned XLSR-53 for Catalan ASR using four different dialect compositions, each containing the Central Catalan dialect. These were defined as 100%, 80%, 50%, and 20% Central Catalan, with the remaining portions split evenly between four other Catalan dialects. While increasing the relative proportion of dialect variants improved models’ dialect robustness, this did not have a meaningful impact on adversarial robustness. These findings suggest that while improvements to ASR can be made by diversifying the training data, such changes do not sufficiently counteract adversarial attacks, leaving the technology open to security threats.