Home > Published Issues > 2021 > Volume 12, No. 2, May 2021 >

A Study of Job Failure Prediction at Job Submit-State and Job Start-State in High-Performance Computing System: Using Decision Tree Algorithms

Anupong Banjongkan, Watthana Pongsena, Nittaya Kerdprasop, and Kittisak Kerdprasop
School of Computer Engineering, Suranaree University of Technology (SUT), Thailand

Abstract—In High-Performance Computing (HPC) system, job failure is a major problem because it means the losses in computation time, resources, and power. Job failure also degrades significantly overall efficiency of the HPC system. In this paper, we propose two sets of models to predict job failure at two points of submission: job submit-state and job start-state. The models can be used as guiding tools for HPC-user to make efficient decision on managing their job submisison on the HPC system. The tools are thus for improving the efficiency of the HPC system at the job level. In the evaluation stage, we conduct a comparative study in order to compare performance of the job failure predictive models developed based on the decision-tree induction techniques including C5.0, Classification and Regression Tree (CART), and Chi-square Automatic Interaction Detector (CHAID). The datasets used for training and testing the models are the two workload logs collected from the HPC system at the National Electronics and Computer Technology Center (NECTEC), Thailand, and the Los Alamos National Laboratory (LANL), USA. To predict failure at the job submit-state and at the job start-state, the results show that the models built from C5.0 algorithm provide the highest accuracy of prediction (around 85% for the NECTEC dataset and 87% for the LANL dataset). The experimental results regarding prediction at different job states reveal that failure forecasting at the job start-state is slightly more accurate than making prediction at the job submit-state (accuracy improvement is around 1.45% for the NECTEC dataset and 0.46% for the LANL dataset). However, when considering both criteria of the performance of the models and the overhead of job waiting time, job failure prediction modeling at the job submit-state provides the best efficiency.
 
Index Terms—decision tree, high-performance computing, job failure prediction, workload log

Cite: Anupong Banjongkan, Watthana Pongsena, Nittaya Kerdprasop, and Kittisak Kerdprasop, "A Study of Job Failure Prediction at Job Submit-State and Job Start-State in High-Performance Computing System: Using Decision Tree Algorithms," Journal of Advances in Information Technology, Vol. 12, No. 2, pp. 84-92, May 2021. doi: 10.12720/jait.12.2.84-92

Copyright © 2021 by the authors. This is an open access article distributed under the Creative Commons Attribution License (CC BY-NC-ND 4.0), which permits use, distribution and reproduction in any medium, provided that the article is properly cited, the use is non-commercial and no modifications or adaptations are made.