Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments

Sahngmin Yoo; Sangwon Lee; Seongin Jo

2026 WACV WACV 2026

Beyond Faces: A Multimodal Person Clustering for Unconstrained Environments

Abstract

The increasing demand for on-device AI, driven by privacy concerns and the need for real-time processing, poses new challenges for fundamental computer vision tasks. This paper addresses one such task, person clustering in photo galleries, which has traditionally relied on server-side computation or simplistic on-device models. We introduce the Multimodal Person Clustering Architecture (MPCA), a research framework designed to explore the feasibility of a high-performance, multimodal clustering pipeline operating entirely under mobile constraints. Our framework makes three principal contributions: (1) Multimodal Appearance-Assisted Identity Recovery (MAIR), a late-fusion strategy that leverages temporal consistency to recover identities when facial data is unreliable; (2) Language-Guided Appearance Extractor (LGAE), which adapts a vision-language paradigm to construct robust appearance representations efficiently; and (3) Sequential Graph-Density Clustering (SGDC), a novel algorithm that synergistically combines graph-based and density-based methods to handle the high variance of appearance data. We demonstrate through extensive experiments that our on-device framework achieves an unprecedented 87.97% average recall, significantly outperforming leading cloud-based commercial systems like Google Photos (77.74%) and on-device systems like Apple Photos (67.84%) and Samsung Gallery (83.39%). This work provides a blueprint for future research in privacy-preserving, efficient, and robust person clustering, highlighting a viable path for deploying next-generation computer vision applications directly on mobile devices.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Machine Learning

🧭 Keyword Pioneer — person clustering

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio