A Purely End-to-End System for Multi-speaker Speech Recognition

Hiroshi Seki; Takaaki Hori; Shinji Watanabe; Jonathan Le Roux; John R. Hershey

2018 ACL ACL 2018

A Purely End-to-End System for Multi-speaker Speech Recognition

Abstract

AbstractRecently, there has been growing interest in multi-speaker speech recognition, where the utterances of multiple speakers are recognized from their mixture. Promising techniques have been proposed for this task, but earlier works have required additional training data such as isolated source signals or senone alignments for effective learning. In this paper, we propose a new sequence-to-sequence framework to directly decode multiple label sequences from a single speech sequence by unifying source separation and speech recognition functions in an end-to-end manner. We further propose a new objective function to improve the contrast between the hidden vectors to avoid generating similar hypotheses. Experimental results show that the model is directly able to learn a mapping from a speech mixture to multiple label sequences, achieving 83.1% relative improvement compared to a model trained without the proposed objective. Interestingly, the results are comparable to those produced by previous end-to-end works featuring explicit separation and recognition modules.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Deep Learning and Speech & Audio

🧭 Keyword Pioneer — end-to-end system

🐣 Hot Topic Early Bird — source separation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Hiroshi Seki , Takaaki Hori , Shinji Watanabe , Jonathan Le Roux , John R. Hershey

Topics

Artificial Intelligence > Core AI > Multi-Agent Systems Artificial Intelligence > Core AI > Multimodal Learning Speech & Audio > Recognition > Automatic Speech Recognition Speech & Audio > Recognition > Speech Recognition Deep Learning > Learning Types > Deep Learning Deep Learning > Learning Types > Representation Learning Deep Learning > Models > Sequence-to-Sequence

Keywords

source separation speech recognition end-to-end learning end-to-end system multi-speaker recognition speech mixture multi-speaker speech recognition hidden vector contrast

Download PDF

Related papers

Economic Event Detection in Company-Specific News Text 2018

Investigating Effective Parameters for Fine-tuning of Word Embeddings Using Only a Small Corpus 2018

SemAxis: A Lightweight Framework to Characterize Domain-Specific Word Semantics Beyond Sentiment 2018

Fighting Offensive Language on Social Media with Unsupervised Text Style Transfer 2018

Affordances in Grounded Language Learning 2018