Home
Author Guide
Editor Guide
Reviewer Guide
Published Issues
Special Issue
Introduction
Special Issues List
Sections and Topics
Sections
Topics
Internet of Things (IoT) in Smart Systems and Applications
Human-Computer Interaction (HCI) in Modern Technological Systems
journal menu
Aims and Scope
Editorial Board
Indexing Service
Article Processing Charge
Open Access
Copyright and Licensing
Preservation and Repository Policy
Publication Ethics
Editorial Process
Contact Us
General Information
ISSN:
1798-2340 (Online)
Frequency:
Monthly
DOI:
10.12720/jait
Indexing:
ESCI (Web of Science)
,
Scopus
,
CNKI
, EBSCO,
etc
.
Acceptance Rate:
17%
APC:
1000 USD
Average Days to Accept:
106 days
Managing Editor:
Ms. Mia Hu
E-mail:
editor@jait.us
Journal Metrics:
Impact Factor 2023: 0.9
4.2
2023
CiteScore
57th percentile
Powered by
Editor-in-Chief
Prof. Kin C. Yow
University of Regina, Saskatchewan, Canada
I'm delighted to serve as the Editor-in-Chief of
Journal of Advances in Information Technology
.
JAIT
is intended to reflect new directions of research and report latest advances in information technology. I will do my best to increase the prestige of the journal.
What's New
2025-04-02
Included in Chinese Academy of Sciences (CAS) Journal Ranking 2025: Q4 in Computer Science
2025-03-20
JAIT Vol. 16, No. 3 has been published online!
2025-02-27
JAIT has launched a new Topic: "Human-Computer Interaction (HCI) in Modern Technological Systems."
Home
>
Published Issues
>
2021
>
Volume 12, No. 4, November 2021
>
Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms
Samia F. Abd-hood
1,2
and Nazlia Omar
1
1. Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia
2. Hadhramout University, Hadhramout, Yemen
Abstract
—Microblogging and social networking sites such as Twitter, Facebook, and Instagram, are becoming increasingly popular, registering more than 500 million posts each day. Twitter uses hashtags that are dynamic, user-generated text, preceded by a pound (#) symbol, to retrieve similar posts or topics, to mark events or to tag channels. Following segmentation, hashtags can be used for many Natural Language Processing (NLP) applications. These include sentiment analysis, text classification, named entity recognition, and sarcasm detection. This study delves into a comparison of three algorithms, namely the Viterbi, Triangular matrix and Word breaker algorithms, to determine the best among the three, for the segmentation of hashtags. These algorithms utilize different resources, to calculate the probability of the segmented parts, in order to rank the possible generated segmentations. For example, while the Viterbi and Triangular Matrix algorithms use two statistical corpora of unigram and bigram, the Word Breaker algorithm uses the n-gram language model. According to conducted experiment, the Viterbi algorithm is better for hashtag segmentation than the Triangular Matrix algorithm. This can be attributed to the manner in which the Viterbi algorithm conducts the backtracking. On the other hand, the Word Breaker algorithm, which can ascertain the meaningful tokens in the form of words, before proceeding with the segmentation of the remaining characters, is considered superior to both the Viterbi and Triangular Matrix algorithms, particularly when it comes to the detection of unknown words. Used together with the Good-Turing smoothing algorithm, the Word Breaker algorithm achieved 86.64% f1-score on a large language model.
Index Terms
—hashtag segmentation, twitter, viterbi algorithm, word breaker algorithm, triangular matrix algorithm
Cite: Samia F. Abd-hood and Nazlia Omar, "Hashtag Segmentation: A Comparative Study Involving the Viterbi, Triangular Matrix and Word Breaker Algorithms," Journal of Advances in Information Technology, Vol. 12, No. 4, pp. 311-318, November 2021. doi: 10.12720/jait.12.4.311-318
Copyright © 2021 by the authors. This is an open access article distributed under the Creative Commons Attribution License (
CC BY-NC-ND 4.0
), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.
6-JAIT-1561-Final-Malaysia
PREVIOUS PAPER
Lost in Virtual Reality? Cognitive Load in High Immersive VR Environments
NEXT PAPER
The Blockchain-Based Model for Professional Growth Data Processing
Article Metrics in Dimensions