Home > Published Issues > 2023 > Volume 14, No. 4, 2023 >
JAIT 2023 Vol.14(4): 863-875
doi: 10.12720/jait.14.4.863-875

PWMStem: A Corpus-Based Suffix Identification and Stripping Algorithm for Multi-lingual Stemming

Abdul Jabbar 1, Manzoor Illahi 1, Sajid Iqbal 2, Amjad Rehman Khan 3,*, Narmine ElHakim 3, and Tanzila Saba 3
1. Department of Computer Science, Comsats University Islamabad (CUI), Main Campus, Park Road, Tarlai Kalan, Islamabad 45550, Pakistan; Email: a.jabbar73@hotmail.com (A.J.), tamimy@comsats.edu.pk (M.I.)
2. Department of Information Systems, College of Computer Science and Information Technology, King Faisal University, Saudi Arabia; Email: siqbal@kfu.edu.sa (S.I.)
3. Artificial Intelligence and Data Analytics Lab CCIS Prince Sultan University Riyadh 11586, Saudi Arabia; Email: nhakim@psu.edu.sa (N.E.), tsaba@psu.edu.sa (T.S.)
*Correspondence: arkhan@psu.edu.sa (A.R.K.)

Manuscript received April 24, 2023; revised June 14, 2023; accepted July 3, 2023; published August 28, 2023.

Abstract—Stemming is a common preprocessing method aggregating all word variants to a standard stem to aid various Natural Language Processing (NLP) tasks. This work proposes a new unsupervised corpus-based stemmer that identifies the candidate suffixes using pivot word matching. Then candidate suffix statistics are used to remove the potential suffixes. After this, lexical similarity is measured to cluster the morphological related words. Finally, the smallest word in each cluster is designated as a stem. To quantify the performance of proposed method, two corpus-based and two linguistic knowledge-based stemmers for Urdu and English languages are used. The performance of each stemmer is evaluated on two different datasets for each language. The results show that the proposed PWMStem method outperforms the selected stemmers, achieving an accuracy of 0.876 for Urdu and 0.877 for English. To assess the performance of PWMStem through different aspects multiple evaluation metrics are used. The evaluation scores of other metrics are Index Compression Factor (ICF) = 73, Mean Number of Words per Conflation Class (MWC) = 3.7 for Urdu, and ICF = 71 and MWC = 3.5 for English. In the Urdu dataset, PWMStem achieved the lowest Under-stemming Index (UI) of 0.026479, Over-stemming Index (OI) of 0.000021, and an Error Rate Relative to Truncation (ERRT) of 0.610. In the English dataset, the values for UI, OI, and ERRT were measured as 0.102089, 0.000015, and 0.498, respectively.
 
Keywords—corpus-based stemming, morphology, natural language processing, Urdu stemmer, words inflection

Cite: Abdul Jabbar, Manzoor Illahi, Sajid Iqbal, Amjad Rehman Khan, Narmine ElHakim, and Tanzila Saba, "PWMStem: A Corpus-Based Suffix Identification and Stripping Algorithm for Multi-lingual Stemming," Journal of Advances in Information Technology, Vol. 14, No. 4, pp. 863-875, 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.