Statistical Process Control for Supporting OS-level Failure Prediction

  • João R. Campos University of Coimbra
  • Rodrigo Pato Nogueira University of Coimbra

Resumo


Software systems are used to execute critical tasks on a daily basis. Failures can easily lead to significant losses or even loss of lives. Online Failure Prediction (OFP) tries to predict incoming failures using the current state of the system. This relies on the premise that there are symptoms (i.e., some misbehavior of the system) prior to failure, however, characterizing the (mis)behavior of a complex system is an open issue. How can we know if the failure predictors are actually modeling the symptoms, and not just identifying correlations in the data? In this work, we explore the use of Statistical Process Control (SPC) to characterize the stability and instability of the Linux Operating System (OS).

Referências

Brown, B. (2020). Facebook’s catastrophic blackout could cost $90 million in lost revenue. Accessed 2023-05-24.

Campos, J. R., Costa, E., and Vieira, M. (2022). A dataset of linux failure data for dependability evaluation and improvement. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W), pages 88–95. IEEE.

Campos, J. R., Costa, E., and Vieira, M. (2023). Online failure prediction through fault injection and machine learning: Methodology and case study. In 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), pages 451–461. IEEE.

Jassas, M. and Mahmoud, Q. H. (2018). Failure analysis and characterization of scheduling jobs in google cluster trace. In IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society, pages 3102–3107. IEEE.

Lim, H. K., Kim, Y., and Kim, M.-K. (2017). Failure prediction using sequential pattern mining in the wire bonding process. IEEE Transactions on Semiconductor Manufacturing, 30(3):285–292.

Liu, J., Pan, C., Lei, F., Hu, D., and Zuo, H. (2021). Fault prediction of bearings based on lstm and statistical process analysis. Reliability Engineering & System Safety, 214:107646.

McFall-Johnsen, M. (2020). Catastrophic software errors doomed boeing’s airplanes. Accessed 2023-05-01.

Ogden, D. A., Arnold, T. L., and Downing, W. D. (2017). A multivariate statistical approach for anomaly detection and condition based maintenance in complex systems. In 2017 IEEE AUTOTESTCON, pages 1–8. IEEE.

Qiu, P. (2013). Introduction to statistical process control. CRC press.

Reis, M. S., Rendall, R., Rato, T. J., Martins, C., and Delgado, P. (2021). Improving the sensitivity of statistical process monitoring of manifolds embedded in high-dimensional spaces: The truncated-q statistic. Chemometrics and Intelligent Laboratory Systems, 215:104369.

Salfner, F., Lenk, M., and Malek, M. (2010). A survey of online failure prediction methods. ACM Computing Surveys (CSUR), 42(3):10:1–10:42.

Tran, K. P. (2022). Introduction to control charts and machine learning for anomaly detection in manufacturing. Control Charts and Machine Learning for Anomaly Detection in Manufacturing, pages 1–6.

Zhang, J., Zhou, K., Huang, P., He, X., Xie, M., Cheng, B., Ji, Y., and hu Wang, Y. (2020). Minority disk failure prediction based on transfer learning in large data centers of heterogeneous disk systems. IEEE Transactions on Parallel and Distributed Systems.
Publicado
24/05/2024
CAMPOS, João R.; NOGUEIRA, Rodrigo Pato. Statistical Process Control for Supporting OS-level Failure Prediction. In: WORKSHOP DE TESTES E TOLERÂNCIA A FALHAS (WTF), 25. , 2024, Niterói/RJ. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 99-103. ISSN 2595-2684. DOI: https://doi.org/10.5753/wtf.2024.2912.