Mitigating Concept Drift in Job Execution Time Prediction Models

Resumo


This work investigates the mitigation of Concept Drift in machine learning models applied to job execution time prediction in High Performance Computing (HPC) systems. The study assesses the impact of a periodic evaluation strategy with retraining in a machine learning model using SLURM logs from a Petrobras supercomputer. Our experiments demonstrate that Concept Drift significantly impacts the reliability of machine learning models over time. We show that a strategy of daily model evaluation and retraining, triggered when the Mean Absolute Percentage Error (MAPE) exceeds 150%, effectively mitigates Concept Drift. This approach significantly reduces the average MAPE in different periods. The results underscore the necessity of daily evaluations to maintain acceptable predictive model performance in such dynamic settings. This study highlights a practical solution for mitigating Concept Drift, thereby enhancing the applicability of ML models in HPC systems and showcasing their potential to optimize processes in critical sectors like oil and gas.
Palavras-chave: Concept Drift, Machine Learning, Classification Tree, C4.5

Referências

Barddal, J. P., Gomes, H. M., Enembreck, F., and Pfahringer, B. A survey on feature drift adaptation: Definition, benchmark, challenges and future directions. Journal of Systems and Software vol. 127, pp. 278–294, 2017.

Bifet, A. Adaptive stream mining: Pattern learning and mining from evolving data streams. Frontiers in Artificial Intelligence and Applications vol. 207, pp. 1–212, 01, 2010.

Carastan-Santos, D., De Camargo, R. Y., Trystram, D., and Zrigui, S. One can only gain by replacing easy backfilling: A simple scheduling policies case study. In 2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID). pp. 1–10, 2019.

Frank, E., Hall, M., Holmes, G., Kirkby, R., Pfahringer, B., Witten, I. H., and Trigg, L. pp. 1269–1277. In , Weka-A Machine Learning Workbench for Data Mining. Springer, pp. 1269–1277, 2010.

Gama, J. a., Sebastião, R., and Rodrigues, P. P. Issues in evaluation of stream learning algorithms. In Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. KDD ’09. Association for Computing Machinery, New York, NY, USA, pp. 329–338, 2009.

Gama, J. a., Žliobaitundefined, I., Bifet, A., Pechenizkiy, M., and Bouchachia, A. A survey on concept drift adaptation. ACM Comput. Surv. 46 (4), mar, 2014.

Menear, K., Nag, A., Perr-Sauer, J., Lunacek, M., Potter, K., and Duplyakin, D. Mastering hpc runtime prediction: From observing patterns to a methodological approach. In Practice and Experience in Advanced Research Computing 2023: Computing for the Common Good. PEARC ’23. Association for Computing Machinery, New York, NY, USA, pp. 75–85, 2023.

Mu’alem, A. and Feitelson, D. Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling. IEEE Transactions on Parallel and Distributed Systems 12 (6): 529–543, 2001.

Mustafiz, S. and Islam, M. R. State-of-the-art petroleum reservoir simulation. Petroleum Science and Technology 26 (10-11): 1303–1329, 2008.

Nunes, A. L., Gallo, B., Lopes, B., Portella, F. A., Viterbo, J., Drummond, L. M. A., Andrade, L., de Lima, M., Estrela, P. J. B., and Malini, R. Q. Two-step estimation strategy for predicting petroleum reservoir simulation jobs runtime on an hpc cluster. Concurrency and Computation: Practice and Experience 37 (4-5): e70026, 2025.

Nunes, A. L., Portella, F., Estrela, P., Malini, R., Lopes, B., Bittencourt, A., Leite, G., Coutinho, G., and Drummond, L. Prediction of Reservoir Simulation Jobs Times Using a Real-World SLURM Log. In Anais do XXIV Simpósio em Sistemas Computacionais de Alto Desempenho. SBC, Porto Alegre/RS, pp. 49–60, 2023.

Portella, F., Buchaca, D., Rodrigues, J. R., and Berral, J. L. TunaOil: A tuning algorithm strategy for reservoir simulation workloads. Journal of Computational Science vol. 63, 2022.

Quinlan, J. R. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers, 1993.

Yoo, A. B., Jette, M. A., and Grondona, M. SLURM: Simple Linux Utility for Resource Management. In Job Scheduling Strategies for Parallel Processing. Springer, pp. 44–60, 2003.

Žliobait ˙e, I. Learning under concept drift: an overview, 2010.
Publicado
29/09/2025
GALLO, Bernardo; DRUMMOND, Lúcia; VITERBO, José. Mitigating Concept Drift in Job Execution Time Prediction Models. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 13. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 81-88. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2025.247724.