Home > Published Issues > 2023 > Volume 14, No. 5, 2023 >
JAIT 2023 Vol.14(5): 934-940
doi: 10.12720/jait.14.5.934-940

Handling Class Imbalance in Google Cluster Dataset Using a New Hybrid Sampling Approach

Jyoti Shetty * and G. Shobha
Department of Computer Science, RV College of Engineering, Bangalore, India; Email: shobhag@rvce.edu.in (G.S.)
*Correspondence: jyothis@rvce.edu.in (J.S.)

Manuscript received September 17, 2022; revised November 12, 2022; accepted December 22, 2022; published September 18, 2023.

Abstract—Class imbalance is a classical problem in data mining, where the classes in a dataset have a disproportionate number of instances. Most machine learning tasks fail to work properly with an imbalanced dataset. There exist various approaches to balance a dataset, but suffer from issues such as overfitting and information loss. This manuscript proposes a novel and improved cluster-based undersampling method for handling two and multi-class imbalanced dataset. Ensemble learning algorithm integrated with the pre-processing technique is used to address the class imbalance problem. The proposed approach is tested using a publicly available imbalanced Google cluster dataset, in case of imbalanced dataset the F1-score value for each class has to be checked, it is observed that the existing approaches F1-score for class 0 was not good, whereas the proposed algorithm had a balanced F1-score of 0.97 for class 0 and 0.96 for class 1. There is an improvement in F1-score of about 2% compared to the existing technique. Similarly for multi-class problem the proposed novel algorithm gave balanced AUC values of 0.87, 0.83 and 0.97 for class 0, class 1 and class 2, respectively.
Keywords—imbalanced dataset, hybrid sampling, google cluster

Cite: Jyoti Shetty and G. Shobha, "Handling Class Imbalance in Google Cluster Dataset Using a New Hybrid Sampling Approach," Journal of Advances in Information Technology, Vol. 14, No. 5, pp. 934-940, 2023.

Copyright © 2023 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.