Home > Published Issues > 2025 > Volume 16, No. 10, 2025 >
JAIT 2025 Vol.16(10): 1470-1478
doi: 10.12720/jait.16.10.1470-1478

Indo-ASR+: Enhancing Indonesian Automatic Speech Recognition by Fine-Tuning Wav2Vec2 with FAdam

Irfan Darmawan 1, Alam Rahmatulloh 2,*, Rohmat Gunawan 2, R. Wahjoe Witjaksono 1, and Ghatan Fauzi Nugraha 3
1. Department of Information System, Telkom University, Bandung, Indonesia
2. Department of Informatics, Siliwangi University, Tasikmalaya, Indonesia
3. Siliwangi Artificial Intelligence Research Group, Siliwangi University, Tasikmalaya, Indonesia
Email: irfandarmawan@telkomuniversity.ac.id (I.D.); alam@unsil.ac.id (A.R.); rohmatgunawan@unsil.ac.id (R.G.); wahyuwicaksono@telkomuniversity.ac.id (R.W.W.); ghatan.fauzi.nurgraha@unj.ac.id (G.F.N)
*Corresponding author

Manuscript received May 27, 2025; revised June 19, 2025; accepted July 28, 2025; published October 24, 2025.

Abstract—Automatic Speech Recognition (ASR) has become a key technology in human-machine interaction, especially in supporting languages with limited resources such as Bahasa Indonesia. Although deep learning-based models such as Wav2Vec2 have shown good performance in speech recognition, further optimization is still needed to improve training accuracy and efficiency, especially in data-constrained and noisy environments. This research focuses on optimizing the Wav2Vec2 model for Indonesian ASR by applying the Fisher Adam (FAdam) optimizer. FAdam combines Natural Gradient Descent (NGD) with Fisher Information Matrix (FIM) to improve learning stability, accelerate convergence, and reduce sensitivity to noise in the data. The model was trained using the Indonesian Common Voice dataset and evaluated based on Word Error Rate (WER) of 5.59% and Character Error Rate (CER) of 1.76% on the validation set. Experimental results show that this approach not only improves accuracy over previous methods, also enhances training efficiency and improves the stability of model convergence compared to state-of-the-art models such as XLSR-53 and XLS-R 300m for Indonesian ASR. In addition, FAdam is shown to provide increased inference speed, making it a more optimal solution for ASR implementation in real-world scenarios. This research contributes to the development of a more adaptive and efficient ASR technology for Indonesian, while opening up further optimization opportunities in self-supervised learning-based models.
 
Keywords—Automatic Speech Recognition (ASR), Bahasa Indonesia, Character Error Rate (CER), Fisher Adam (FAdam), Wav2Vec2, Word Error Rate (WER)

Cite: Irfan Darmawan, Alam Rahmatulloh, Rohmat Gunawan, R. Wahjoe Witjaksono, and Ghatan Fauzi Nugraha, "Indo-ASR+: Enhancing Indonesian Automatic Speech Recognition by Fine-Tuning Wav2Vec2 with FAdam," Journal of Advances in Information Technology, Vol. 16, No. 10, pp. 1470-1478, 2025. doi: 10.12720/jait.16.10.1470-1478

Copyright © 2025 by the authors. This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited (CC BY 4.0).

Article Metrics in Dimensions