An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech

Jianrong Wang; Nan Gu; Mei Yu; Xuewei Li; Qiang Fang; Li Liu

2021 INTERSPEECH INTERSPEECH 2021

An Attention Self-Supervised Contrastive Learning Based Three-Stage Model for Hand Shape Feature Representation in Cued Speech

Abstract

Cued Speech (CS) is a communication system for deaf people or hearing impaired people, in which a speaker uses it to aid a lipreader in phonetic level by clarifying potentially ambiguous mouth movements with hand shape and positions. Feature extraction of multi-modal CS is a key step in CS recognition. Recent supervised deep learning based methods suffer from noisy CS data annotations especially for hand shape modality. In this work, we first propose a self-supervised contrastive learning method to learn the feature representation of image without using labels. Secondly, a small amount of manually annotated CS data are used to fine-tune the first module. Thirdly, we present a module, which combines Bi-LSTM and self-attention networks to further learn sequential features with temporal and contextual information. Besides, to enlarge the volume and the diversity of the current limited CS datasets, we build a new British English dataset containing 5 native CS speakers. Evaluation results on both French and British English datasets show that our model achieves over 90% accuracy in hand shape recognition. Significant improvements of 8.75% (for French) and 10.09% (for British English) are achieved in CS phoneme recognition correctness compared with the state-of-the-art.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Speech & Audio

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Jianrong Wang , Nan Gu , Mei Yu , Xuewei Li , Qiang Fang , Li Liu

Topics

Machine Learning > Learning Types > Contrastive Learning Machine Learning > Learning Types > Self-Supervised Learning Deep Learning > Architectures > Transformers Computer Vision > Analysis > Human Analysis Speech & Audio > Recognition > Speech Recognition Deep Learning > Techniques > Contrastive Learning Deep Learning > Learning Types > Self-Supervised Learning Deep Learning > Learning Types > Contrastive Learning

Keywords

contrastive learning self-supervised learning attention mechanism feature representation bidirectional lstm cued speech self-supervised contrastive learning hand shape recognition

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021