Home > Published Issues > 2021 > Volume 12, No. 2, May 2021 >

Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field

Kannikar Paripremkul and Ohm Sornil
Graduate School of Applied Statistics, National Institute of Development Administration (NIDA), Thailand

Abstract—Word segmentation is important to natural language processing tasks. Thai language as well as many Asian languages does not have word delimiter. Word segmentation in Thai language does not only require to focus on dividing a sequence of characters into meaningful words, but the word must also be divided correctly and relevant to the context of a sentence. With the popularity of social media, unknown, informal and slang words are widely used, in addition to words adopted from other languages. Word segmentation methods, generally trained from formal corpuses or dictionaries, do not yield good performance. This research proposes a novel technique to Thai word segmentation where the smallest units constituting words are first extracted, then combined into syllables using Conditional Random Field. Words are then segmented by merging the syllables together with a set of rules learned from language characteristics. The technique is evaluated on both formal and informal datasets against a method based on a convolutional neural network, currently giving the best performance for Thai word segmentation. The results show that the proposed method outperforms the comparing system and gives F-score of 0.9965 and 0.9857 for formal and informal text, respectively.
Index Terms—word segmentation, syllable segmentation, minimum text unit, conditional random field

Cite: Kannikar Paripremkul and Ohm Sornil, "Segmenting Words in Thai Language Using Minimum Text Units and Conditional Random Field," Journal of Advances in Information Technology, Vol. 12, No. 2, pp. 135-141, May 2021. doi: 10.12720/jait.12.2.135-141

Copyright © 2021 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.