ABSTRACT
Given the complexity of modern software systems, it is no longer possible to detect every fault before deployment. Such faults can eventually lead to failures at runtime, compromising the business process and causing significant risk or losses. Online Failure Prediction (OFP) is a complementary fault-tolerance technique that tries to predict failures in the near future, by using past data and the current state of the system. However, modern systems are comprised of many components and thus a proper characterization of its state requires hundreds of system metrics. As the system evolves through time, these data can be seen as multivariate time series, where the value of a system metric at a given time is related to its previous value. Although various techniques exist for leveraging this autocorrelation, they are often either simplistic (e.g., sliding-window), or too complex (e.g., Long-Short Term Memory (LSTM)). In this paper we propose the use of numerical differentiation, computing the first and second derivative, as a means to extract information concerning the underlying function of each system metric to support the development of predictive models for OFP. We conduct a comprehensive case using a Linux failure dataset that was generated through fault injection. Results suggest that numerical differentiation can be a promising approach to improve the performance of Machine Learning (ML) models for dependability-related problems with similar sequential characteristics (e.g., intrusion detection).
- Nesreen K. Ahmed, Amir F. Atiya, Neamat El Gayar, and Hisham El-Shishiny. 2010. An empirical comparison of machine learning models for time series forecasting. Econometric Reviews 29, 5 (2010), 594–621. https://doi.org/10.1080/07474938.2010.481556Google ScholarCross Ref
- Landwehr Carl Algirdas Avižienis, Laprie Jean-Claude, Randell Brian. 2004. Basic Concepts and Taxonomy of Dependable and Secure Computing. IEEE Trans. Depend. Sec. Comput. 1, 1 (2004), 11–33. https://doi.org/10.1109/TDSC.2004.2Google ScholarDigital Library
- Ethem Alpaydin. 2014. Introduction to Machine Learning, 3rd ed., ser. Adaptive Computation and Machine Learning. The MIT Press.Google Scholar
- Gianluca Bontempi, Souhaib Ben Taieb, and Yann Aël Le Borgne. 2013. Machine learning strategies for time series forecasting. Lecture Notes in Business Information Processing 138 LNBIP (2013), 62–77. https://doi.org/10.1007/978-3-642-36318-4_3 arxiv:z0037Google ScholarCross Ref
- Ben Brown. 2023. Facebook’s Catastrophic Blackout Could Cost $90 Million in Lost Revenue. https://www.ccn.com/facebooks-blackout-90-million-lost-revenue/ Accessed 2023-05-24.Google Scholar
- João R Campos and Ernesto Costa. 2020. Fault Injection to Generate Failure Data for Failure Prediction: A Case Study. In 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 115–126.Google ScholarCross Ref
- João R Campos, Ernesto Costa, and Marco Vieira. 2022. A Dataset of Linux Failure Data for Dependability Evaluation and Improvement. In 2022 52nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 88–95.Google ScholarCross Ref
- João R Campos, Ernesto Costa, and Marco Vieira. 2022. On the Applicability of Machine Learning-based Online Failure Prediction for Modern Complex Systems. In 2022 18th European Dependable Computing Conference (EDCC). IEEE, 49–56.Google Scholar
- João R Campos, Ernesto Costa, and Marco Vieira. 2022. Online Failure Prediction for Complex Systems: Methodology and Case Studies. IEEE Transactions on Dependable and Secure Computing (2022).Google Scholar
- João R Campos, Marco Vieira, and Ernesto Costa. 2019. Propheticus: Machine learning framework for the development of predictive models for reliable and secure software. In 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 173–182.Google ScholarCross Ref
- Xin Chen, Charng-Da Lu, and Karthik Pattabiraman. 2014. Failure prediction of jobs in compute clouds: A google cluster case study. In 2014 IEEE International Symposium on Software Reliability Engineering Workshops. IEEE, 341–346.Google ScholarDigital Library
- Jan G. De Gooijer and Rob J. Hyndman. 2006. 25 Years of Time Series Forecasting. International Journal of Forecasting 22, 3 (2006), 443–473. https://doi.org/10.1016/j.ijforecast.2006.01.001 arxiv:Rodgers, J. L., & Nicewander, W. A. (2008). Thirteen Ways to Look at the Correlation Coefficient, 42(1), 59–66.Google ScholarCross Ref
- J. P. Marques de Sá. 2001. Pattern recognition ; concepts, methods and applications. Springer. ISBN: 3540422978.Google Scholar
- Andy Field. 2013. Discovering Statistics Using IBM SPSS Statistics (4th ed.). Sage Publications Ltd.Google ScholarDigital Library
- C. Fisher. 2023. Boeing found another software bug on the 737 Max. http://www.engadget.com/2020-02-06-boeing-737-max-software-bug.html Accessed 2023-05-24.Google Scholar
- T. Hastie, R. Tibshirani, and J. Friedman. 2009. The Elements of Statistical Learning. Springer, New York.Google Scholar
- Shilin He, Pinjia He, Zhuangbin Chen, Tianyi Yang, Yuxin Su, and Michael R Lyu. 2021. A survey on automated log analysis for reliability engineering. ACM Computing Surveys (CSUR) 54, 6 (2021), 1–37.Google ScholarDigital Library
- Christian Herff and Dean J Krusienski. 2019. Extracting features from time series. Fundamentals of Clinical Data Science (2019), 85–100.Google Scholar
- Ivano Irrera and Marco Vieira. 2015. Towards assessing representativeness of fault injection-generated failure data for online failure prediction. In 2015 IEEE International Conference on Dependable Systems and Networks Workshops. IEEE, 75–80.Google ScholarDigital Library
- Mohammad Jassas and Qusay H Mahmoud. 2018. Failure analysis and characterization of scheduling jobs in google cluster trace. In IECON 2018-44th Annual Conference of the IEEE Industrial Electronics Society. IEEE, 3102–3107.Google ScholarCross Ref
- Mohammad S Jassas and Qusay H Mahmoud. 2020. Evaluation of a Failure Prediction Model for Large Scale Cloud Applications. In Canadian Conference on Artificial Intelligence. Springer, 321–327.Google ScholarDigital Library
- Aziliz Le Glaz, Yannis Haralambous, Deok-Hee Kim-Dufor, Philippe Lenca, Romain Billot, Taylor C Ryan, Jonathan Marsh, Jordan Devylder, Michel Walter, Sofian Berrouiguet, 2021. Machine learning and natural language processing in mental health: Systematic review. Journal of Medical Internet Research 23, 5 (2021), e15708.Google ScholarCross Ref
- Qingwei Lin, Tianci Li, Pu Zhao, Yudong Liu, Minghua Ma, Lingling Zheng, Murali Chintalapati, Bo Liu, Paul Wang, Hongyu Zhang, 2023. EDITS: An Easy-to-difficult Training Strategy for Cloud Failure Prediction. In Companion Proceedings of the ACM Web Conference 2023. 371–375.Google Scholar
- Gabriel Resende Machado, Eugênio Silva, and Ronaldo Ribeiro Goldschmidt. 2021. Adversarial Machine Learning in Image Classification: A Survey Toward the Defender’s Perspective. ACM Computing Surveys (CSUR) 55, 1 (2021), 1–38.Google ScholarDigital Library
- Miquel Martinez, Juan Carlos Ruiz, Nuno Antunes, David De Andres, and Marco Vieira. 2020. A Multi-criteria Analysis of Benchmark Results With Expert Support for Security Tools. IEEE Transactions on Dependable and Secure Computing (2020).Google Scholar
- Ram K Mazumder, Abdullahi M Salman, and Yue Li. 2021. Failure risk analysis of pipelines using data-driven machine learning algorithms. Structural Safety 89 (2021), 102047.Google ScholarCross Ref
- Morgan McFall-Johnsen. 2023. Catastrophic software errors doomed Boeing’s airplanes. https://www.businessinsider.com/boeing-software-errors-jeopardized-starliner-spaceship-737-max-planes-2020-2 Accessed 2023-05-24.Google Scholar
- Roberto Natella, Domenico Cotroneo, João Durães, and Henrique Madeira. 2010. Representativeness analysis of injected software faults in complex software. In Proceedings of the International Conference on Dependable Systems and Networks. 437–446. https://doi.org/10.1109/DSN.2010.5544282Google ScholarCross Ref
- Netdata. n.d.. Netdata. https://www.netdata.cloud/ Accessed 2023-05-01.Google Scholar
- P. Nunes, I. Medeiros, J. C. Fonseca, N. Neves, M. Correia, and M. Vieira. 2018. Benchmarking Static Analysis Tools for Web Security. IEEE Trans. Rel. 67, 3 (Sep. 2018), 1159–1175. https://doi.org/10.1109/TR.2018.2839339Google ScholarCross Ref
- Ajoy K Palit and Dobrivoje Popovic. 2006. Computational intelligence in time series forecasting: theory and engineering applications. Springer Science & Business Media.Google Scholar
- Yashwant Singh Patel and Jatin Bedi. 2023. MAG-D: A multivariate attention network based approach for cloud workload forecasting. Future Generation Computer Systems (2023).Google Scholar
- Teerat Pitakrat, Jonas Grunert, Oliver Kabierschke, Fabian Keller, and André Van Hoorn. 2014. A framework for system event classification and prediction by means of machine learning. In Proceedings of the 8th International Conference on Performance Evaluation Methodologies and Tools. 173–180.Google ScholarDigital Library
- Teerat Pitakrat, Dušan Okanović, André van Hoorn, and Lars Grunske. 2018. Hora: Architecture-aware online failure prediction. Journal of Systems and Software 137 (2018), 669–685.Google ScholarCross Ref
- David MW Powers. 2020. Evaluation: from precision, recall and F-measure to ROC, informedness, markedness and correlation. arXiv preprint arXiv:2010.16061 (2020).Google Scholar
- Alfio Quarteroni, Fausto Saleri, and Paola Gervasio. 2016. Scientific Computing with MATLAB and Octave. Springer Publishing Company, Incorporated.Google Scholar
- Alicia Robles-Velasco, Pablo Cortés, Jesús Muñuzuri, and Bernard De Baets. 2023. Prediction of pipe failures in water supply networks for longer time periods through multi-label classification. Expert Systems with Applications 213 (2023), 119050.Google ScholarDigital Library
- Stuart Russell and Peter Norvig. 2021. Artificial Intelligence: A Modern Approach, Global Edition (4 ed.). Pearson.Google Scholar
- Felix Salfner, Maren Lenk, and Miroslaw Malek. 2010. A survey of online failure prediction methods. ACM Computing Surveys (CSUR) 42, 3 (2010), 10:1–10:42. https://doi.org/10.1145/1670679.1670680Google ScholarDigital Library
- Hyungjun Seo, Jaechun No, and Sung-soon Park. 2023. ml-SFP: System Failure Prediction Method Based on Machine Learning. In Intelligent Sustainable Systems: Selected Papers of WorldS4 2022, Volume 2. Springer, 195–203.Google Scholar
- Ubuntu. n.d.. stress-ng. https://manpages.ubuntu.com/manpages/artful/man1/stress-ng.1.html Accessed 2023-05-15.Google Scholar
- Usenix and Carnegie Mellon University. n.d.. Computer Failure Data Repository. https://www.usenix.org/cfdr. Accessed 2023-05-01.Google Scholar
- Juan Manuel Vilar. 2009. Classifying Time Series Data : A Nonparametric Approach. Journal of Classification 8, April (2009), 3–28. https://doi.org/10.1007/s00357-00Google ScholarDigital Library
- Pin Wang, En Fan, and Peng Wang. 2021. Comparative analysis of image classification algorithms based on traditional machine learning and deep learning. Pattern Recognition Letters 141 (2021), 61–67.Google ScholarCross Ref
- Zhilong Wang, Min Zhang, Danshi Wang, Chuang Song, Min Liu, Jin Li, Liqi Lou, and Zhuo Liu. 2017. Failure prediction using machine learning and time series in optical network. Optics Express 25, 16 (2017), 18553–18565.Google ScholarCross Ref
- Ji Zhang, Ke Zhou, Ping Huang, Xubin He, Ming Xie, Bin Cheng, Yongguang Ji, and Yin hu Wang. 2020. Minority Disk Failure Prediction based on Transfer Learning in Large Data Centers of Heterogeneous Disk Systems. IEEE Transactions on Parallel and Distributed Systems (2020).Google ScholarCross Ref
Index Terms
- Leveraging Time Series Autocorrelation Through Numerical Differentiation for Improving Failure Prediction
Recommendations
Towards Assessing Representativeness of Fault Injection-Generated Failure Data for Online Failure Prediction
DSN-W '15: Proceedings of the 2015 IEEE International Conference on Dependable Systems and Networks WorkshopsOnline Failure Prediction allows improving system dependability by foreseeing incoming failures at runtime, enabling mitigation actions to be taken in advance, though prediction systems' learning and assessing is hard due to the scarcity of failure ...
Adaptive Failure Prediction for Computer Systems: A Framework and a Case Study
HASE '15: Proceedings of the 2015 IEEE 16th International Symposium on High Assurance Systems EngineeringOnline Failure Prediction allows improving system dependability by foreseeing incoming failures at runtime, enabling mitigation actions to be taken in advance. Despite advances in the last years, Online Failure Prediction is still not adopted due to the ...
Increasing Dependability of Component-Based Software Systems by Online Failure Prediction (Short Paper)
EDCC '14: Proceedings of the 2014 Tenth European Dependable Computing ConferenceOnline failure prediction for large-scale software systems is a challenging task. One reason is the complex structure of many-partially inter-dependent-hardware and software components. State-of-the-art approaches use separate prediction models for ...
Comments