Corpus-Based Vocabulary List for Thai Language

Hathairat Ketmaneechairat 1,* and Maleerat Maliyaem 2
1. College of Industrial Technology, King Mongkut’s University of Technology North Bangkok, Thailand
2. Information Technology and Digital Innovation, King Mongkut’s University of Technology North Bangkok, Thailand; Email: maleerat.m@itd.kmutnb.ac.th (M.M.)
*Correspondence: hathairat.k@cit.kmutnb.ac.th (H.K.)

Manuscript received September 5, 2022; revised October 9, 2022; accepted January 3, 2023; published April 13, 2023.

Abstract—For natural language processing, a corpus is important for training models as also for the algorithms to create the machine learning models. This paper aimed to describe the design and process in creating a corpus-based vocabulary in the Thai language that can be used as a main corpus for natural language processing research. A corpus is created under the regulation of language. By using the actual Word Usage Frequency (WUF) analyzed from a text corpus cover several types of contents. The results presented the frequency of use of several characteristics, namely the frequency of word use character usage frequency and the frequency of using bigram characters. To be used in this research and used as important information for further NLP research. Based on the findings, it was concluded that the average word length increases when the number of words in the corpus increases. It means that the correlation between word length and frequency of words is in the same direction.
Keywords—corpus-based vocabulary, Thai language, frequency of words, statistical data

