SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer

Wenqi Zhong; Linzhi Yu; Chen Xia; Junwei Han; Dingwen Zhang

2024 AAAI AAAI 2024

SpFormer: Spatio-Temporal Modeling for Scanpaths with Transformer

Abstract

Abstract Saccadic scanpath, a data representation of human visual behavior, has received broad interest in multiple domains. Scanpath is a complex eye-tracking data modality that includes the sequences of fixation positions and fixation duration, coupled with image information. However, previous methods usually face the spatial misalignment problem of fixation features and loss of critical temporal data (including temporal correlation and fixation duration). In this study, we propose a Transformer-based scanpath model, SpFormer, to alleviate these problems. First, we propose a fixation-centric paradigm to extract the aligned spatial fixation features and tokenize the scanpaths. Then, according to the visual working memory mechanism, we design a local meta attention to reduce the semantic redundancy of fixations and guide the model to focus on the meta scanpath. Finally, we progressively integrate the duration information and fuse it with the fixation features to solve the problem of ambiguous location with the Transformer block increasing. We conduct extensive experiments on four databases under three tasks. The SpFormer establishes new state-of-the-art results in distinct settings, verifying its flexibility and versatility in practical applications. The code can be obtained from https://github.com/wenqizhong/SpFormer.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning

🧭 Keyword Pioneer — visual working memory

🐣 Hot Topic Early Bird — spatio-temporal modeling

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Wenqi Zhong , Linzhi Yu , Chen Xia , Junwei Han , Dingwen Zhang

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Human Analysis Computer Vision > Processing > Video Understanding Computer Vision > Analysis > Video Understanding Artificial Intelligence > Core AI > Computer Vision Computer Vision > Analysis > Computer Vision

Keywords

transformer architecture spatio-temporal modeling eye tracking visual attention scanpath prediction visual working memory

Download PDF

Related papers

Goal Alignment: Re-analyzing Value Alignment Problems Using Human-Aware AI 2024

Meta-Inverse Reinforcement Learning for Mean Field Games via Probabilistic Context Variables 2024

Suppressing Uncertainty in Gaze Estimation 2024

Mask-Homo: Pseudo Plane Mask-Guided Unsupervised Multi-Homography Estimation 2024

Heterogeneous Test-Time Training for Multi-Modal Person Re-identification 2024