Uncovering Syllable Constituents in the Self-Attention-Based Speech Representations of Whisper

Erfan A Shams; Iona Gessinger; Julie Carson-berndsen

2024 EMNLP EMNLP 2024

Uncovering Syllable Constituents in the Self-Attention-Based Speech Representations of Whisper

Abstract

AbstractAs intuitive units of speech, syllables have been widely studied in linguistics. A syllable can be defined as a three-constituent unit with a vocalic centre surrounded by two (in some languages optional) consonant clusters. Syllables are also used to design automatic speech recognition (ASR) models. The significance of knowledge-driven syllable-based tokenisation in ASR over data-driven byte-pair encoding has often been debated. However, the emergence of transformer-based ASR models employing self-attention (SA) overshadowed this debate. These models learn the nuances of speech from large corpora without prior knowledge of the domain; yet, they are not interpretable by design. Consequently, it is not clear if the recent performance improvements are related to the extraction of human-interpretable knowledge. We probe such models for syllable constituents and use an SA head pruning method to assess the relevance of the SA weights. We also investigate the role of vowel identification in syllable constituent probing. Our findings show that the general features of syllable constituents are extracted in the earlier layers of the model and the syllable-related features mostly depend on the temporal knowledge incorporated in specific SA heads rather than on vowel identification.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — syllable constituent

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Erfan A Shams , Iona Gessinger , Julie Carson-berndsen

Topics

Artificial Intelligence > Core AI > Interpretability Deep Learning > Architectures > Transformers Deep Learning > Techniques > Pretraining Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition Speech & Audio > Analysis > Speech Analysis Deep Learning > Techniques > Attention Artificial Intelligence > Core AI > Speech Processing

Keywords

self-attention mechanism automatic speech recognition model interpretability speech representation attention head pruning probing method syllable constituent self-attention weight vowel identification

Download PDF

Related papers

EmbodiedBERT: Cognitively Informed Metaphor Detection Incorporating Sensorimotor Information 2024

Mitigating Matthew Effect: Multi-Hypergraph Boosted Multi-Interest Self-Supervised Learning for Conversational Recommendation 2024

Learning to Extract Structured Entities Using Language Models 2024

Towards Understanding Jailbreak Attacks in LLMs: A Representation Space Analysis 2024

CSSL: Contrastive Self-Supervised Learning for Dependency Parsing on Relatively Free Word Ordered and Morphologically Rich Low Resource Languages 2024