Home > Published Issues > 2022 > Volume 13, No. 4, August 2022 >
JAIT 2022 Vol.13(4): 393-397
doi: 10.12720/jait.13.4.393-397

Deep Learning System Based on the Separation of Audio Sources to Obtain the Transcription of a Conversation

Nahum Flores, Daniel Angeles, and Sebastian Tuesta
Faculty of System Engineering and Informatic, Universidad Nacional Mayor de San Marcos, Lima, Peru

Abstract—Podcasting has lately been in the spotlight for being the fastest-growing format, especially during the pandemic. This growth has highlighted the need for making podcasts accessible to diverse audiences, especially those having auditory disabilities. The current transcription methods have been unsatisfactory; therefore, we present an alternative method to transcribe audio files into text by segmenting audio sources. The applied methodology considers the construction of a public audio dataset having a duration of more than 15h. The training model was based on three scenarios in which the duration of the training data was varied to determine the best performance, which was 10.77 in terms of the scale-invariant signal-to-noise ratio. We have simplified podcasting accessibility by making available the source code of each component that we developed.
Index Terms—public dataset, deep learning, audio source separation, speech to text

Cite: Nahum Flores, Daniel Angeles, and Sebastian Tuesta, "Deep Learning System Based on the Separation of Audio Sources to Obtain the Transcription of a Conversation," Journal of Advances in Information Technology, Vol. 13, No. 4, pp. 393-397, August 2022.

Copyright © 2022 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.