Active Speakers in Context

Juan Leon Alcazar; Fabian Caba; Long Mai; Federico Perazzi; Joon-Young Lee; Pablo Arbelaez; Bernard Ghanem

2020 CVPR CVPR 2020

Active Speakers in Context

Abstract

Current methods for active speaker detection focus on modeling audiovisual information from a single speaker. This strategy can be adequate for addressing single-speaker scenarios, but it prevents accurate detection when the task is to identify who of many candidate speakers are talking. This paper introduces the Active Speaker Context, a novel representation that models relationships between multiple speakers over long time horizons. Our new model learns pairwise and temporal relations from a structured ensemble of audiovisual observations. Our experiments show that a structured feature ensemble already benefits active speaker detection performance. We also find that the proposed Active Speaker Context improves the state-of-the-art on the AVA-ActiveSpeaker dataset achieving an mAP of 87.1%. Moreover, ablation studies verify that this result is a direct consequence of our long-term multi-speaker analysis.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning and Speech & Audio

🧭 Keyword Pioneer — audiovisual information

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Speech & Audio

Authors

Juan Leon Alcazar , Fabian Caba , Long Mai , Federico Perazzi , Joon-Young Lee , Pablo Arbelaez , Bernard Ghanem

Topics

Machine Learning > Learning Types > Self-Supervised Learning Computer Vision > Analysis > Activity Recognition Computer Vision > Processing > Video Understanding Speech & Audio > Recognition > Speaker Recognition Computer Vision > Core AI > Multimodal Learning Artificial Intelligence > Core AI > Multi-Modal Learning

Keywords

pairwise relation feature ensemble temporal relation active speaker detection audiovisual information long-term analysis audiovisual processing multi-speaker analysis

Download PDF

Related papers

Deep Polarization Cues for Transparent Object Segmentation 2020

HRank: Filter Pruning Using High-Rank Feature Map 2020

Panoptic-Based Image Synthesis 2020

Select, Supplement and Focus for RGB-D Saliency Detection 2020

ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings 2020