Memory Error Driven Server Failure Detection

  • Rafael A. Silva UFC
  • Francisco Lucas F. Pereira UFC
  • Victor A. E. Farias UFC
  • Felipe T. Brito UFC
  • Javam C. Machado UFC

Resumo


The correct functioning of Dynamic Random Access Memory (DRAM) is of fundamental relevance to the functioning of servers in data centers. Therefore, being able to detect server failure caused by memory errors is fundamental to the development of prediction methods that can be used to avoid server failure caused by memory errors. Thus, ensuring the continuous availability of the hosted services. In recent years, many authors proposed machine learning-based methods to predict server failure based on the occurrence of DRAM errors. However, from previous works, one can notice that this is a challenging task due to the lack of data and the irregularity in which memory errors occur. In this work, through feature engineering, we look forward to improving the classification accuracy of recurrent neural networks at dealing with irregularly sampled data in order to improve the accuracy in identifying servers that are nearing a failure state.

Palavras-chave: Classification, LSTM, Failure Detection, Memory

Referências

Alibaba. Large-scale dataset for prediction of server failures due to dram errors, 2023. [link] Accessed: (2024-07-15).

Awasthi, M., Shevgoor, M., Sudan, K., Rajendran, B., Balasubramonian, R., and Srinivasan, V. Efficient scrub mechanisms for error-prone emerging memories. In IEEE International Symposium on High-Performance Comp Architecture. IEEE, pp. 1–12, 2012.

Beigi, M. V., Cao, Y., Gurumurthi, S., Recchia, C., Walton, A., and Sridharan, V. A systematic study of ddr4 dram faults in the field. In 2023 IEEE International Symposium on High-Performance Computer Architecture (HPCA). IEEE, pp. 991–1002, 2023.

Bengio, Y., De Mori, R., Flammia, G., and Kompe, R. Global optimization of a neural network-hidden markov model hybrid. IEEE transactions on Neural Networks 3 (2): 252–259, 1992.

Bogatinovski, J., Kao, O., Yu, Q., and Cardoso, J. First ce matters: On the importance of long term properties on memory failure prediction. In 2022 IEEE International Conference on Big Data (Big Data). IEEE, pp. 4733–4736, 2022.

Breiman, L. Random forests. Machine learning vol. 45, pp. 5–32, 2001.

Chen, T. and Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. pp. 785–794, 2016.

Cheng, Z., Han, S., Lee, P. P., Li, X., Liu, J., and Li, Z. An in-depth correlative study between dram errors and server failures in production data centers. In 2022 41st International Symposium on Reliable Distributed Systems (SRDS). IEEE, pp. 262–272, 2022.

Du, X., Li, C., Zhou, S., Ye, M., and Li, J. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data. In 2020 16th European Dependable Computing Conference (EDCC). IEEE, pp. 41–46, 2020.

Felix, G. S., Pereira, F. F., Praciano, F. D., Gomes, J. P., and Machado, J. C. Dynamic sample weighting to predict the remaining useful life of hard disk drives. In Anais do XI Symposium on Knowledge Discovery, Mining and Learning. SBC, pp. 89–96, 2023.

Gong, S.-L., Kim, J., and Erez, M. Dram scaling error evaluation model using various retention time. In 2017 47th Annual IEEE/IFIP International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, pp. 177–183, 2017.

Graves, A. and Schmidhuber, J. Framewise phoneme classification with bidirectional lstm networks. In Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005. Vol. 4. IEEE, pp. 2047–2052, 2005.

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M. H., Brett, M., Haldane, A., del Río, J. F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., and Oliphant, T. E. Array programming with NumPy. Nature 585 (7825): 357–362, Sept., 2020.

Imambi, S., Prakash, K. B., and Kanagachidambaresan, G. Pytorch. Programming with TensorFlow: solution for edge computing applications, 2021.

May, T. C. and Woods, M. H. Alpha-particle-induced soft errors in dynamic memories. IEEE transactions on Electron devices 26 (1): 2–9, 1979.

Meza, J., Wu, Q., Kumar, S., and Mutlu, O. Revisiting memory errors in large-scale production data centers: Analysis and modeling of new trends from the field. In 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. IEEE, pp. 415–426, 2015.

pandas development team, T. pandas-dev/pandas: Pandas, 2020.

Pereira, F. L. F., Bucar, R. C., Brito, F. T., Gomes, J. P. P., and Machado, J. C. Predicting failures in hdds with deep nn and irregularly-sampled data. In Brazilian Conference on Intelligent Systems. Springer, pp. 196–209, 2022.

Siami-Namini, S., Tavakoli, N., and Namin, A. S. The performance of lstm and bilstm in forecasting time series. In 2019 IEEE International conference on big data (Big Data). IEEE, pp. 3285–3292, 2019.

Sun, X., Chakrabarty, K., Huang, R., Chen, Y., Zhao, B., Cao, H., Han, Y., Liang, X., and Jiang, L. System-level hardware failure prediction using deep learning. In Proceedings of the 56th Annual Design Automation Conference 2019. pp. 1–6, 2019.

Yan, S. Understanding lstm networks. Online). Accessed on August vol. 11, 2015.

Yu, F., Xu, H., Jian, S., Huang, C., Wang, Y., and Wu, Z. Dram failure prediction in large-scale data centers. In 2021 IEEE International Conference on Joint Cloud Computing (JCC). IEEE, pp. 1–8, 2021.

Yu, Q., Zhang, W., Notaro, P., Haeri, S., Cardoso, J., and Kao, O. Himfp: Hierarchical intelligent memory failure prediction for cloud service reliability. In 2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN). IEEE, pp. 216–228, 2023.

Ziegler, J. F. and Lanford, W. A. Effect of cosmic rays on computer memories. Science 206 (4420): 776–788, 1979.
Publicado
17/11/2024
SILVA, Rafael A.; PEREIRA, Francisco Lucas F.; FARIAS, Victor A. E.; BRITO, Felipe T.; MACHADO, Javam C.. Memory Error Driven Server Failure Detection. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 12. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 161-168. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2024.244764.