MAAS: Multi-Modal Assignation for Active Speaker Detection

Juan Leon Alcazar; Fabian Caba; Ali K. Thabet; Bernard Ghanem

2021 ICCV ICCV 2021

MAAS: Multi-Modal Assignation for Active Speaker Detection

Abstract

Active speaker detection requires a solid integration of multi-modal cues. While individual modalities can approximate a solution, accurate predictions can only be achieved by explicitly fusing the audio and visual features and modeling their temporal progression. Despite its inherent muti-modal nature, current methods still focus on modeling and fusing short-term audiovisual features for individual speakers, often at frame level. In this paper we present a novel approach to active speaker detection that directly addresses the multi-modal nature of the problem, and provides a straightforward strategy where independent visual features from potential speakers in the scene are assigned to a previously detected speech event. Our experiments show that, an small graph data structure built from local information, allows to approximate an instantaneous audio-visual assignment problem. Moreover, the temporal extension of this initial graph achieves a new state-of-the-art performance on the AVA-ActiveSpeaker dataset with a mAP of 88.8%.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — assignation problem

🐣 Hot Topic Early Bird — multi-modal fusion

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Speech & Audio

Authors

Juan Leon Alcazar , Fabian Caba , Ali K. Thabet , Bernard Ghanem

Topics

Computer Vision > Analysis > Action Recognition Machine Learning > Learning Types > Multi-Modal Learning Deep Learning > Learning Types > Multi-Modal Learning Artificial Intelligence > Core AI > Multi-Modal Learning Computer Vision > Analysis > Multi-Modal Learning

Keywords

temporal modeling graph structure multi-modal fusion active speaker detection assignation problem

Download PDF

Related papers

Spatial-Temporal Transformer for Dynamic Scene Graph Generation 2021

ARAPReg: An As-Rigid-As Possible Regularization Loss for Learning Deformable Shape Generators 2021

A Broad Study on the Transferability of Visual Representations With Contrastive Learning 2021

Query Adaptive Few-Shot Object Detection With Heterogeneous Graph Convolutional Networks 2021

Self-Supervised Neural Networks for Spectral Snapshot Compressive Imaging 2021