LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling

Miquel India; José A.R. Fonollosa; Javier Hernando

2017 INTERSPEECH INTERSPEECH 2017

LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling

Abstract

This paper presents a new speaker change detection system based on Long Short-Term Memory (LSTM) neural networks using acoustic data and linguistic content. Language modelling is combined with two different Joint Factor Analysis (JFA) acoustic approaches: i-vectors and speaker factors. Both of them are compared with a baseline algorithm that uses cosine distance to detect speaker turn changes. LSTM neural networks with both linguistic and acoustic features have been able to produce a robust speaker segmentation. The experimental results show that our proposal clearly outperforms the baseline system.

🌉 Interdisciplinary Bridge — Deep Learning and Machine Learning

🧭 Keyword Pioneer — language modelling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Deep Learning, Healthcare & Medicine, Machine Learning, Natural Language Processing, Speech & Audio

Authors

Miquel India , José A.R. Fonollosa , Javier Hernando

Topics

Machine Learning > Core Methods > Classification Machine Learning > Learning Types > Weakly Supervised Learning Deep Learning > Architectures > Neural Networks Natural Language Processing > Resources & Methods > Language Modeling Speech & Audio > Analysis > Speech Analysis

Keywords

acoustic feature language modelling long short-term memory network lstm neural network speaker change detection joint factor analysis speaker segmentation

Download PDF

Related papers

Description of the Munich-Passau Snore Sound Corpus (MPSSC) 2017

A Study on Replay Attack and Anti-Spoofing for Automatic Speaker Verification 2017

Binaural Reverberant Speech Separation Based on Deep Neural Networks 2017

Building Audio-Visual Phonetically Annotated Arabic Corpus for Expressive Text to Speech 2017

A Comparison of Danish Listeners’ Processing Cost in Judging the Truth Value of Norwegian, Swedish, and English Sentences 2017