A Recovery Module for an Incident Management Assistant with Missing and Imbalanced Data
Resumo
In Information Technology (IT) Service Management, modern approaches leveraging Artificial Intelligence for IT Operations (AIOps) typically rely on abundant, high-quality observability data. However, many real-world environments operate under severe data constraints, such as scarce historical records and incomplete information. This paper introduces AIMA+, an incident management assistant designed specifically for these challenging conditions. Given an incident’s textual description as input, AIMA+ outputs a ranked list of predicted recovery actions, retrieves similar past incidents, and generates a synthesized textual summary to guide operators. Our core contribution is a methodology that addresses data incompleteness through a Large Language Model (LLM)-based data augmentation strategy and data scarcity with an interpretable, multi-pronged framework. Our experiments show that augmenting the dataset with expert-guided LLM classifications dramatically improved predictive performance, increasing the macro F1-score from a baseline of 0.2 to 0.6. This work presents a pragmatic blueprint for developing effective AIOps decision support tools in realistic, imperfect industrial settings.
Palavras-chave:
Incident Management, AIOps, Recovery Actions, Imbalanced Data, Large Language Models, Multilabel Classification
Referências
Chen, Z., et al. (2023). Chat-based Incident Triage with Retrieval Augmented Generation. In ESEC/FSE.
Du, M., et al. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In CCS.
Nguyen, T. T., et al. (2012). Recommending Similar Bug Reports and Their Fixes. In ICSE.
Wang, X., et al. (2019). Learning to Predict and Mitigate Push-Caused Production Failures. In OSDI.
Yuan, Z., et al. (2021). NetSieve: A Scalable and Robust Causal Inference Framework for Incident Analysis. In SOSP.
Zhang, X., et al. (2019). LogRobust: A Robust Online System Log Anomaly Detection System. In USENIX Security Symposium.
Du, M., et al. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. In CCS.
Nguyen, T. T., et al. (2012). Recommending Similar Bug Reports and Their Fixes. In ICSE.
Wang, X., et al. (2019). Learning to Predict and Mitigate Push-Caused Production Failures. In OSDI.
Yuan, Z., et al. (2021). NetSieve: A Scalable and Robust Causal Inference Framework for Incident Analysis. In SOSP.
Zhang, X., et al. (2019). LogRobust: A Robust Online System Log Anomaly Detection System. In USENIX Security Symposium.
Publicado
12/11/2025
Como Citar
BONATO, Alisson N.; MARQUES, Jade S. Hatanaka; KUMAR, Rajnish.
A Recovery Module for an Incident Management Assistant with Missing and Imbalanced Data. In: ESCOLA REGIONAL DE APRENDIZADO DE MÁQUINA E INTELIGÊNCIA ARTIFICIAL DA REGIÃO SUL (ERAMIA-RS), 1. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 376-379.
DOI: https://doi.org/10.5753/eramiars.2025.16747.