Open-Vocabulary Fine-Grained Hand Action Detection

Ting Zhe; Mengya Han; Xiaoshuai Hao; Yong Luo; Zheng He; Xiantao Cai; Jing Zhang

2025 IJCAI IJCAI 2025

Open-Vocabulary Fine-Grained Hand Action Detection

Abstract

In this work, we address the new challenge of open-vocabulary fine-grained hand action detection, which aims to recognize hand actions from both known and novel categories using textual descriptions. Traditional hand action detection methods are limited to closed-set detection, making it difficult for them to generalize to new, unseen hand action categories. While current open-vocabulary detection (OVD) methods are effective at detecting novel objects, they face challenges with fine-grained action recognition, particularly when data is limited and heterogeneous. This often leads to poor generalization and performance bias between base and novel categories. To address these issues, we propose a novel approach, Open-FGHA (Open-vocabulary Fine-Grained Hand Action), which learns to distinguish fine-grained features across multiple modalities from limited heterogeneous data. It then identifies optimal matching relationships among these features, enabling accurate open-vocabulary fine-grained hand action detection. Specifically, we introduce three key components: Hierarchical Heterogeneous Low-Rank Adaptation, Bidirectional Selection and Fusion Mechanism, and Cross-Modality Query Generator. These components work in unison to enhance the alignment and fusion of multimodal fine-grained features. Extensive experiments demonstrate that Open-FGHA outperforms existing OVD methods, showing its strong potential for open-vocabulary hand action detection. The source code is available at OV-FGHAD.

🧭 Keyword Pioneer — hand action recognition

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Machine Learning, Natural Language Processing

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

Authors

Ting Zhe , Mengya Han , Xiaoshuai Hao , Yong Luo , Zheng He , Xiantao Cai , Jing Zhang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Machine Learning > Learning Types > Zero-Shot Learning Computer Vision > Analysis > Action Recognition Computer Vision > Analysis > Human Analysis

Keywords

open-vocabulary detection multimodal learning hand action recognition fine-grained detection low-rank adaptation heterogeneous datum fine-grained hand action detection

Download PDF

Related papers

Learning Advanced Self-Attention for Linear Transformers in the Singular Value Domain 2025

Responsibility Anticipation and Attribution in LTLf 2025

Argument-based Multi-Issue Negotiation 2025

Online Resource Sharing: Better Robust Guarantees via Randomized Strategies 2025

Equitable Mechanism Design for Facility Location 2025