Multimodal Interpretable Depression Analysis using Visual Physiological Audio and Textual Data

Puneet Kumar; Shreshtha Misra; Zhuhong Shao; Bin Zhu; Balasubramanian Raman; Xiaobai Li

2025 WACV WACV 2025

Multimodal Interpretable Depression Analysis using Visual Physiological Audio and Textual Data

Abstract

Motivated by depression's significant impact on global health this work proposes MultiDepNet a novel multimodal interpretable depression detection system integrating visual physiological audio and textual data. Through dedicated feature extraction methods (MTCNN for video TS-CAN for physiological ResNet-18 for audio and RoBERTa for text modalities) and a strategic fusion of modality-specific networks including CNN-RNN Transformer MLP and ResNet-18 it achieves significant advancements in depression detection. Its performance evaluated across four benchmark datasets (AVEC 2013 AVEC 2014 DAIC and E-DAIC) demonstrates average MAE of 5.64 RMSE of 7.15 accuracy of 74.19 precision of 0.7373 recall of 0.7378 and F1 of 0.7376. It also implements a MultiViz-based interpretability mechanism that computes each modality's contribution to the model's performance. The results reveal the visual modality to be the most significant contributing 37.88% towards depression detection.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Healthcare & Medicine

🧭 Keyword Pioneer — physiological signal processing

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Puneet Kumar , Shreshtha Misra , Zhuhong Shao , Bin Zhu , Balasubramanian Raman , Xiaobai Li

Topics

Artificial Intelligence > Core AI > Interpretability Artificial Intelligence > Core AI > Multimodal Learning Computer Vision > Analysis > Biometrics Computer Vision > Domain-Specific > Medical Imaging Healthcare & Medicine > Clinical > Mental Health

Keywords

text analysis multimodal learning audio processing physiological signal depression detection facial expression analysis physiological signal processing text sentiment analysis

Download PDF

Related papers

Neural Graph Map: Dense Mapping with Efficient Loop Closure Integration 2025

ELMGS: Enhancing Memory and Computation Scalability through Compression for 3D Gaussian Splatting 2025

Feature Fusion Transferability Aware Transformer for Unsupervised Domain Adaptation 2025

Uncertainty-Aware Online Extrinsic Calibration: A Conformal Prediction Approach 2025

Disentangling Spatio-Temporal Knowledge for Weakly Supervised Object Detection and Segmentation in Surgical Video 2025