Home > Published Issues > 2021 > Volume 12, No. 4, November 2021 >

Vietnamese News Articles Classification Using Neural Networks

To Nguyen Phuoc Vinh and Ha Hoang Kha
Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam
Vietnam National University Ho Chi Minh City, Ho Chi Minh City, Vietnam

Abstract—In this paper, a new benchmark Vietnamese online news article dataset for a multi–label task is introduced. The dataset is collected from well-known Vietnamese news websites and, then, it is assigned into 30 topics comparable to the way that editors label their articles. This leads our dataset to be very suitable for training practical applications in Vietnamese text classification. Furthermore, we modify the original pipeline of Vietnamese text classification by cutting the dimension of feature vectors based on the term frequency across the whole corpus which has been combined in the term frequency-inverse document frequency weighting step, instead of applying feature selection algorithms after extracting a huge dimension term frequency–inverse document frequency feature vector. Although this makes the computational complexity of method decreased, input feature vectors are weak due to removal of feature selection steps. Thus, utilizing the powerful neural network models for classification helps the efficiency be still as good as the original method, even it is slightly better.
 
Index Terms—term frequency-inverse document frequency, neural network model, text classification, Vietnamese online news article dataset

Cite: To Nguyen Phuoc Vinh and Ha Hoang Kha, "Vietnamese News Articles Classification Using Neural Networks," Journal of Advances in Information Technology, Vol. 12, No. 4, pp. 363-369, November 2021. doi: 10.12720/jait.12.4.363-369

Copyright © 2021 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.