Home > Published Issues > 2021 > Volume 12, No. 4, November 2021 >

Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms

Samia F. Abd-hood 1,2 and Nazlia Omar 1
1. Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia
2. Hadhramout University, Hadhramout, Yemen

Abstract—Microblogging and social networking sites such as Twitter, Facebook, and Instagram, are becoming increasingly popular, registering more than 500 million posts each day. Twitter uses hashtags that are dynamic, user-generated text, preceded by a pound (#) symbol, to retrieve similar posts or topics, to mark events or to tag channels. Following segmentation, hashtags can be used for many Natural Language Processing (NLP) applications. These include sentiment analysis, text classification, named entity recognition, and sarcasm detection. This study delves into a comparison of three algorithms, namely the Viterbi, Triangular matrix and Word breaker algorithms, to determine the best among the three, for the segmentation of hashtags. These algorithms utilize different resources, to calculate the probability of the segmented parts, in order to rank the possible generated segmentations. For example, while the Viterbi and Triangular Matrix algorithms use two statistical corpora of unigram and bigram, the Word Breaker algorithm uses the n-gram language model. According to conducted experiment, the Viterbi algorithm is better for hashtag segmentation than the Triangular Matrix algorithm. This can be attributed to the manner in which the Viterbi algorithm conducts the backtracking. On the other hand, the Word Breaker algorithm, which can ascertain the meaningful tokens in the form of words, before proceeding with the segmentation of the remaining characters, is considered superior to both the Viterbi and Triangular Matrix algorithms, particularly when it comes to the detection of unknown words. Used together with the Good-Turing smoothing algorithm, the Word Breaker algorithm achieved 86.64% f1-score on a large language model.
 
Index Terms—hashtag segmentation, twitter, viterbi algorithm, word breaker algorithm, triangular matrix algorithm

Cite: Samia F. Abd-hood and Nazlia Omar, "Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms," Journal of Advances in Information Technology, Vol. 12, No. 4, pp. 311-318, November 2021. doi: 10.12720/jait.12.4.311-318

Copyright © 2021 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.