Home > Published Issues > 2023 > Volume 14, No. 3, 2023 >
JAIT 2023 Vol.14(3): 550-558
doi: 10.12720/jait.14.3.550-558

Optimized Deep Neural Networks Audio Tagging Framework for Virtual Business Assistant

Fatma Sh. El-metwally 1, Ali I. Eldesouky 1, Nahla B. Abdel-Hamid 1, and Sally M. Elghamrawy 2,3,*
1. Department of Computer Engineering and Control Systems, Faculty of Engineering, Mansoura University, Mansoura, Egypt
2. Department of Computer Engineering, MISR Higher Institute for Engineering and Technology, Mansoura, Egypt
3. Scientific Research Group in Egypt (SRGE), Egypt
*Correspondence: sally@mans.edu.eg, sally_elghamrawy@ieee.org (S.M.E.)

Manuscript received July 1, 2022; revised August 12, 2022; accepted November 11, 2022; published June 16, 2023.

Abstract—A virtual assistant has a huge impact on business and an organizations development. It can be used to manage customer relations and deal with received queries, automatically reply to e-mails and phone calls.Audio signal processing has become increasingly popular since the development of virtual assistants. Deep learning and audio signal processing advancements have dramatically enhanced audio tagging. Audio Tagging (AT) is a challenge that requires eliciting descriptive labels from audio clips. This study proposes an Optimized Deep Neural Networks Audio Tagging Framework for Virtual Business Assistant to categorize and analyze audio tagging. Each input signal is used to extract the various audio tagging features. The extracted features are input into a neural network to carry out a multi-label classification for the predicted tags. Optimization techniques are used to improve the quality of the model fit for neural networks. To test the efficiency of the framework, four comparison experiments have been conducted between it and some of the others. From these results, it was concluded that this framework is better than the others in terms of efficiency. When the neural network was trained, Mel-Frequency Cepstral Coefficient (MFCC) features with Adamax achieved the best results with 93% accuracy and a 0.17% loss. When evaluating the performance of the model for seven labels, it achieved an average of precision 0.952, recall 0.952, F-score 0.951, accuracy 0.983, and an equal error rate of 0.015 in the evaluation set compared to the provided Detection and Classification of Acoustic Scenes and Events (DSCASE) baseline where he achieved and accuracy of 72.5% and a 0.209 equal error rate.
 
Keywords—audio tagging, Deep Neural Networks (DNNs), optimizations, Detection and Classification of Acoustic Scenes and Events (DCASE)

Cite: Fatma Sh. El-metwally, Ali I. Eldesouky, Nahla B. Abdel-Hamid, and Sally M. Elghamrawy, "Optimized Deep Neural Networks Audio Tagging Framework for Virtual Business Assistant," Journal of Advances in Information Technology, Vol. 14, No. 3, pp. 550-558, 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.