Home > Published Issues > 2025 > Volume 16, No. 12, 2025 >
JAIT 2025 Vol.16(12): 1836-1854
doi: 10.12720/jait.16.12.1836-1854

An Adaptive Multi-Scale Feature Fusion Framework for Detecting Depression and Assessing Its Severity from Speech Signals

Raminder Kaur Nagra and Vikram Kulkarni *
Department of Information Technology, Mukesh Patel School of Technology Management & Engineering, Shri Vile Parle Kelvani Mandal’s Narsee Monjee Institute of Management Studies, Mumbai, India
Email: raminderkaur.nagra@nmims.edu (R.K.N.); vikram.kulkarni@nmims.edu (V.K.)
*Corresponding author

Manuscript received June 25, 2025; revised July 14, 2025; accepted August 28, 2025; published December 18, 2025.

Abstract—Depression is a chronic mental health disorder characterized by psychological, physical, and social impairments, often resulting in diminished interest in daily activities and, in severe cases, suicidal ideation. Recent physiological studies have revealed measurable differences in vocal attributes between depressed and non-depressed individuals, motivating the use of speech signal analysis for automatic depression detection. However, conventional approaches relying on Mel-frequency or Fourier-based spectrograms often fail to capture discriminative features for effective emotional classification. To address this limitation, this study proposes an optimized deep learning framework, Adaptive Multi-scale Fused Feature-based Deep Network (AMF2-DNet) for both depression detection and severity assessment. Speech signals from the Distress Analysis Interview Corpus with Wizard-of-Oz (DAIC-WOZ) benchmark dataset undergo noise reduction and normalization, followed by feature extraction, including Mel-Frequency Cepstral Coefficients (MFCCs), spectral features, and deep features learned through a Sparse Autoencoder (SAE). These multimodal features are integrated within a residual Bidirectional Recurrent Neural Network (Bi-RNN) architecture enhanced with a novel loss function for improved convergence. Further, the Refined Attacking Strategy of Giant Armadillo Optimization (RASGAO) dynamically optimizes network parameters and scale weights. Experimental results demonstrate that AMF2-DNet achieves 94.88% accuracy, an F1-Score of 84.50%, and a Mathews Correlation Coefficient (MCC) of 0.82, surpassing state-of-the-art baselines by up to 7.44% in accuracy. The proposed framework also exhibits strong robustness across noise levels and speaker variability. Future work will focus on real-time deployment using lightweight architectures, cross-lingual validation, and multimodal fusion with facial and physiological cues for enhanced clinical applicability.
 
Keywords—feature extraction, speech recognition, speech analysis, recurrent neural network, affective computing

Cite: Raminder Kaur Nagra and Vikram Kulkarni, "An Adaptive Multi-Scale Feature Fusion Framework for Detecting Depression and Assessing Its Severity from Speech Signals," Journal of Advances in Information Technology, Vol. 16, No. 12, pp. 1836-1854, 2025. doi: 10.12720/jait.16.12.1836-1854

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions