Home > Published Issues > 2023 > Volume 14, No. 6, 2023 >
JAIT 2023 Vol.14(6): 1382-1389
doi: 10.12720/jait.14.6.1382-1389

Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper

Zhanibek Kozhirbayev
National Laboratory Astana, Nazarbayev University, Kazakhstan
Email: zhanibek.kozhirbayev@nu.edu.kz

Manuscript received May 30 2023; revised June 19, 2023; accepted July 6, 2023; published December 14, 2023.

Abstract—In recent years, the progress made in neural models trained on extensive multilingual text or speech data has shown great potential for improving the status of underresourced languages. This paper focuses on experimenting with three state-of-the-art speech recognition models, namely Facebook’s Wav2Vec2.0 and Wav2Vec2-XLS-R, OpenAI’s Whisper, on the Kazakh language. The objective of this research is to investigate the effectiveness of these models in transcribing Kazakh speech and to compare their performance with existing supervised Automatic Speech Recognition (ASR) systems. The study also aims to explore the possibility of using data from other languages for pre-training and to test whether fine-tuning the target language data can improve model performance. Thus, this work can provide insights into the effectiveness of using pretrained multilingual models in underresourced language settings. The wav2vec2.0 model achieved a Character Error Rate (CER) of 2.8 and a Word Error Rate (WER) of 8.7 on the test set, which closely matches the best result achieved by the end-to-end Transformer model. The large whisper model achieves a CER of approximately 4 on the test set. The results of this study can contribute to the development of robust and efficient ASR systems for the Kazakh language, benefiting various applications, including speech-to-text translation, voice assistants, and speech-based communication tools.
 
Keywords—automatic speech recognition, Wav2Vec 2.0, Wav2Vec2-XLS-R, whisper, pretrained transformer models, speech representation models

Cite: Zhanibek Kozhirbayev, "Kazakh Speech Recognition: Wav2vec2.0 vs. Whisper," Journal of Advances in Information Technology, Vol. 14, No. 6, pp. 1382-1389, 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.