Home > Published Issues > 2022 > Volume 13, No. 5, October 2022 >
JAIT 2022 Vol.13(5): 486-502
doi: 10.12720/jait.13.5.486-502

Effect of Features Extraction Techniques on Cyberstalking Detection Using Machine Learning Framework

Arvind Kumar Gautam and Abhishek Bansal
Department of Computer Science, Indira Gandhi National Tribal University, Amarkantak, M.P., India

Abstract—Various cybercriminals are active with predefined and preplanned agendas to carry out cybercrimes in the Internet world. Cyberstalking, cyberbullying, cyber terrorism, cyber hacking, data leakage, identity theft, phishing, and other types of cyber harassment continually occur in the virtual world. Cyberstalking and cyberbullying are near to close in content and intent, involving the same internet-based technology to harass, bully and undermine others online. This paper implemented a cyberstalking detection model and analyzed the effect of various feature extraction techniques on different machine learning classifiers for cyberstalking detection. For feature extraction, the proposed model applied Word2vec, BOW, TF-IDF, FastText, GloVe, ELMo, and BERT. Logistic Regression (LR), Support Vector Machine (SVM), K-Nearest Neighbor (KNN), Random Forest (RF), Naive Bayes (NB), and Decision Tree (DT) were used for classification. Effects of each feature extraction method to enhance the performance of the detection model were determined based on the performance results of applied classifiers with each feature extraction process. Experimental results show that BOW and TF-IDF outperformed advanced word embedding-based feature extraction methods. BOW (for LR) achieved the highest accuracy of 95.7%, highest precision of 97.9%, and highest F-Score of 97.3%. TF-IDF achieved the highest recall of 99.8% for NB. SVM classifier achieved the second-highest accuracy of 95.2% with TF-IDF. BERT model successfully obtained maximum accuracy of 90.9% and 90.7% for LR and SVM, respectively. ELMo model also performed well and produced maximum accuracy of 90.5% and 90.2% for LR and SVM, respectively. The SkipGram model of Word2Vec provided an accuracy of 85% for the LR classifier. GloVe provided 81.2% accuracy for the RF classifier. SkipGram and the CBOW model of FastText provided 85.7% and 82.2% accuracy, respectively, for the RF classifier.
Index Terms—features extraction, word embedding, machine learning, cyberstalking detection, cyberbullying bag of words, TF-IDF, Word2Vec, GloVe, FastText, ELMo, BERT
Cite: Arvind Kumar Gautam and Abhishek Bansal, "Effect of Features Extraction Techniques on Cyberstalking Detection Using Machine Learning Framework," Journal of Advances in Information Technology, Vol. 13, No. 5, pp. 486-502, October 2022.

Copyright © 2022 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.