Multimodal Interpretable Depression Analysis using Visual Physiological Audio and Textual Data
Abstract
Motivated by depression's significant impact on global health this work proposes MultiDepNet a novel multimodal interpretable depression detection system integrating visual physiological audio and textual data. Through dedicated feature extraction methods (MTCNN for video TS-CAN for physiological ResNet-18 for audio and RoBERTa for text modalities) and a strategic fusion of modality-specific networks including CNN-RNN Transformer MLP and ResNet-18 it achieves significant advancements in depression detection. Its performance evaluated across four benchmark datasets (AVEC 2013 AVEC 2014 DAIC and E-DAIC) demonstrates average MAE of 5.64 RMSE of 7.15 accuracy of 74.19 precision of 0.7373 recall of 0.7378 and F1 of 0.7376. It also implements a MultiViz-based interpretability mechanism that computes each modality's contribution to the model's performance. The results reveal the visual modality to be the most significant contributing 37.88% towards depression detection.