Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion

Baptiste Pouthier; Laurent Pilati; Leela K. Gudupudi; Charles Bouveyron; Frederic Precioso

2021 INTERSPEECH INTERSPEECH 2021

Active Speaker Detection as a Multi-Objective Optimization with Uncertainty-Based Multimodal Fusion

Abstract

It is now well established from a variety of studies that there is a significant benefit from combining video and audio data in detecting active speakers. However, either of the modalities can potentially mislead audiovisual fusion by inducing unreliable or deceptive information. This paper outlines active speaker detection as a multi-objective learning problem to leverage best of each modalities using a novel self-attention, uncertainty-based multimodal fusion scheme. Results obtained show that the proposed multi-objective learning architecture outperforms traditional approaches in improving both mAP and AUC scores. We further demonstrate that our fusion strategy surpasses, in active speaker detection, other modality fusion methods reported in various disciplines. We finally show that the proposed method significantly improves the state-of-the-art on the AVA-ActiveSpeaker dataset.

🧭 Keyword Pioneer — active speaker detection

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

🌉 Interdisciplinary Bridge — Computer Vision and Machine Learning and Speech & Audio

Authors

Baptiste Pouthier , Laurent Pilati , Leela K. Gudupudi , Charles Bouveyron , Frederic Precioso

Topics

Machine Learning > Optimization & Theory > Optimization Speech & Audio > Analysis > Speaker Verification Machine Learning > Learning Types > Multi-Task Learning Computer Vision > Analysis > Video Understanding

Keywords

self-attention mechanism multimodal learning video understanding multi-objective optimization uncertainty estimation multimodal fusion active speaker detection uncertainty-based fusion

Download PDF

Related papers

Energy-Friendly Keyword Spotting System Using Add-Based Convolution 2021

Dialogue Situation Recognition for Everyday Conversation Using Multimodal Information 2021

Using Games to Augment Corpora for Language Recognition and Confusability 2021

A Psychology-Driven Computational Analysis of Political Interviews 2021

The 2020 Personalized Voice Trigger Challenge: Open Datasets, Evaluation Metrics, Baseline System and Results 2021