Sharingan: A Transformer Architecture for Multi-Person Gaze Following

Samy Tafasca; Anshul Gupta; Jean-Marc Odobez

2024 CVPR CVPR 2024

Sharingan: A Transformer Architecture for Multi-Person Gaze Following

Abstract

Gaze is a powerful form of non-verbal communication that humans develop from an early age. As such modeling this behavior is an important task that can benefit a broad set of application domains ranging from robotics to sociology. In particular the gaze following task in computer vision is defined as the prediction of the 2D pixel coordinates where a person in the image is looking. Previous attempts in this area have primarily centered on CNN-based architectures but they have been constrained by the need to process one person at a time which proves to be highly inefficient. In this paper we introduce a novel and effective multi-person transformer-based architecture for gaze prediction. While there exist prior works using transformers for multi-person gaze prediction they use a fixed set of learnable embeddings to decode both the person and its gaze target which requires a matching step afterward to link the predictions with the annotations. Thus it is difficult to quantitatively evaluate these methods reliably with the available benchmarks or integrate them into a larger human behavior understanding system. Instead we are the first to propose a multi-person transformer-based architecture that maintains the original task formulation and ensures control over the people fed as input. Our main contribution lies in encoding the person-specific information into a single controlled token to be processed alongside image tokens and using its output for prediction based on a novel multiscale decoding mechanism. Our new architecture achieves state-of-the-art results on the GazeFollow VideoAttentionTarget and ChildPlay datasets and outperforms comparable multi-person architectures with a notable margin. Our code checkpoints and data extractions will be made publicly available soon.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — multiscale decoding

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Samy Tafasca , Anshul Gupta , Jean-Marc Odobez

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Human Analysis Artificial Intelligence > Core AI > Computer Vision

Keywords

human analysis attention mechanism gaze prediction gaze estimation gaze following multiscale decoding multi-person transformer multi person

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024