LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Dominick Reilly; Rajatsubhra Chakraborty; Arkaprava Sinha; Manish Kumar Govind; Pu Wang; francois bremond; Le Xue; Srijan Das

2025 CVPR CVPR 2025

LLAVIDAL: A Large LAnguage VIsion Model for Daily Activities of Living

Abstract

Current Large Language Vision Models (LLVMs) trained on web videos perform well in general video understanding but struggle with fine-grained details, complex human-object interactions (HOI), and view-invariant representation learning essential for Activities of Daily Living (ADL). This limitation stems from a lack of specialized ADL video instruction-tuning datasets and insufficient modality integration to capture discriminative action representations. To address this, we propose a semi-automated framework for curating ADL datasets, creating ADL-X, a multi-view, multi-modal RGBS instruction-tuning dataset. Additionally, we introduce LLAVIDAL, an LLVM integrating videos, 3D skeletons, and HOIs to model ADL's complex spatiotemporal relationships. For training LLAVIDAL, a simple joint alignment of all modalities yields suboptimal results; thus, we propose a Multimodal Progressive (MMPro) training strategy, incorporating modalities in stages following a curriculum. We also establish ADL MCQ and video description benchmarks to assess LLVM performance in ADL tasks. Trained on ADL-X, LLAVIDAL achieves state-of-the-art performance across ADL benchmarks. Code and data will be made publicly available at https://adl-x.github.io/.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning

🧭 Keyword Pioneer — large language vision model

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Dominick Reilly , Rajatsubhra Chakraborty , Arkaprava Sinha , Manish Kumar Govind , Pu Wang , francois bremond , Le Xue , Srijan Das

Topics

Deep Learning > Architectures > Transformers Computer Vision > Analysis > Action Recognition Computer Vision > Analysis > Activity Recognition Computer Vision > Processing > Video Understanding Deep Learning > Models > Large Language Models Computer Vision > Core AI > Multimodal Learning Computer Vision > Analysis > Video Understanding Deep Learning > Models > Vision-Language Models

Keywords

action recognition temporal modeling multimodal learning video understanding human-object interaction instruction tuning activity recognition action representation large language model large language vision model skeleton recognition

Download PDF

Related papers

AnyCam: Learning to Recover Camera Poses and Intrinsics from Casual Videos 2025

SeriesBench: A Benchmark for Narrative-Driven Drama Series Understanding 2025

FADE: Frequency-Aware Diffusion Model Factorization for Video Editing 2025

Fast and Accurate Gigapixel Pathological Image Classification with Hierarchical Distillation Multi-Instance Learning 2025

Reversible Decoupling Network for Single Image Reflection Removal 2025