STAViS: Spatio-Temporal AudioVisual Saliency Network

Antigoni Tsiami; Petros Koutras; Petros Maragos

2020 CVPR CVPR 2020

STAViS: Spatio-Temporal AudioVisual Saliency Network

Abstract

We introduce STAViS, a spatio-temporal audiovisual saliency network that combines spatio-temporal visual and auditory information in order to efficiently address the problem of saliency estimation in videos. Our approach employs a single network that combines visual saliency and auditory features and learns to appropriately localize sound sources and to fuse the two saliencies in order to obtain a final saliency map. The network has been designed, trained end-to-end, and evaluated on six different databases that contain audiovisual eye-tracking data of a large variety of videos. We compare our method against 8 different state-of-the-art visual saliency models. Evaluation results across databases indicate that our STAViS model outperforms our visual only variant as well as the other state-of-the-art models in the majority of cases. Also, the consistently good performance it achieves for all databases indicates that it is appropriate for estimating saliency "in-the-wild". The code is available at https://github.com/atsiami/STAViS.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — audiovisual saliency network

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Antigoni Tsiami , Petros Koutras , Petros Maragos

Topics

Computer Vision > Analysis > Scene Understanding Computer Vision > Processing > Video Understanding Computer Vision > Core AI > Multimodal Learning Computer Vision > Analysis > Video Understanding Deep Learning > Learning Types > Multi-Modal Learning Computer Vision > Analysis > Computer Vision

Keywords

multimodal learning video understanding visual attention saliency detection spatio-temporal analysis end-to-end training saliency estimation video saliency audiovisual saliency network eye tracking dataset audiovisual saliency

Download PDF

Related papers

Deep Polarization Cues for Transparent Object Segmentation 2020

HRank: Filter Pruning Using High-Rank Feature Map 2020

Panoptic-Based Image Synthesis 2020

Select, Supplement and Focus for RGB-D Saliency Detection 2020

ClusterVO: Clustering Moving Instances and Estimating Visual Odometry for Self and Surroundings 2020