Categorization of Security Incidents Using Prompt Engineering in LLMs
Abstract
A crescente complexidade e volume de incidentes de cibersegurança têm gerado grandes quantidades de dados não estruturados, dificultando sua triagem por equipes humanas. Este trabalho propõe o uso de Engenharia de Prompts aplicada a LLMs para a categorização automatizada desses incidentes. A metodologia foi testada em um conjunto de dados reais e anonimizados, avaliando a consistência das classificações em diferentes cenários: categorização livre, categorização guiada por taxonomia (NIST), com e sem refinamento progressivo dos prompts. Os resultados indicam que a combinação entre Progressive-hint Prompting (PHP) e o uso de taxonomia estruturada favorece a normalização semântica, reduz a ambiguidade e melhora a confiabilidade das classificações, com alto grau de assertividade na categorização de incidentes.References
AXELOS (2025). What is it service management. [link].
Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., and Kurzweil, R. (2018). Universal sentence encoder.
CERT.br (2025). Incidentes notificados ao cert.br. [link].
Chen, J., Pan, X., Yu, D., Song, K., Wang, X., Yu, D., and Chen, J. (2023). Skills-in-context: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304.
Chen, J., Tian, J., and Jin, Y. (2024). Self-hint prompting improves zero-shot reasoning in large language models via reflective cycle. In Proceedings of the 46th Annual Conference of the Cognitive Science Society.
Cichonski, P., Millar, T., Grance, T., and Scarfone, K. (2012). Computer security incident handling guide. Technical Report NIST Special Publication 800-61 Revision 2.
ENISA (2018). Reference incident classification taxonomy. [link].
F5 Networks (2024). Generative ai for threat modeling and incident response. [link].
FIRST (2025). First csirt services framework. [link].
Google (2025). Gemini models. [link].
Grispos, G. (2016). Cybercrime and Organizational Response: Exploring the Roles of Digital Forensics Investigations and Information Security Policy. PhD thesis.
Grispos, G., Glisson, W. B., and Storer, T. (2019). How good is your data? investigating the quality of data generated during security incident response investigations.
IBM (2024). O que é engenharia de prompt? [link].
Ibrishimova, M. D. (2019). Cyber incident classification: Issues and challenges. In Xhafa, F., Leu, F.-Y., Ficco, M., and Yang, C.-T., editors, Advances on P2P, Parallel, Grid, Cloud and Internet Computing.
Ji, H., Yang, J., Chai, L., Wei, C., Yang, L., Duan, Y., Wang, Y., Sun, T., Guo, H., Li, T., Ren, C., and Li, Z. (2024). Sevenllm: Benchmarking, eliciting, and enhancing abilities of large language models in cyber threat intelligence. arXiv preprint arXiv:2405.03446.
Kim, J.-y. and Kwon, H.-Y. (2022). Threat classification model for security information event management focusing on model efficiency. Computers & Security, 120:102789.
Li, Y., Tian, J., He, H., and Jin, Y. (2024). Hypothesis testing prompting improves deductive reasoning in large language models. arXiv preprint arXiv:2405.06707.
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
Lukwaro, E., Kalegele, K., and Nyambo, D. (2024). A review on nlp techniques and associated challenges in extracting features from education data. International Journal of Computing and Digital Systems, 16:2210–142.
Meta (2025). LLAMA Models. [link].
Ming, Y., Yin, H., and Li, Y. (2021). On the impact of spurious correlation for out-of-distribution detection.
MITRE (2025). Att&ck matrix for enterprise. [link].
Molleti, R., Goje, V., Luthra, P., and Raghavan, P. (2024). Automated threat detection and response using llm agents. Journal of Advanced Research and Reviews, 24(2).
Nelson, A., Rekhi, S., Souppaya, M., and Scarfone, K. (2025). Incident response recommendations and considerations for cybersecurity risk management: A csf 2.0 community profile. Technical Report NIST SP 800-61r3.
OASIS (2025). Sharing threat intelligence just got a lot easier! [link].
Ogundairo, O. and Broklyn, P. (2024). Natural language processing for cybersecurity incident analysis. Journal of Cyber Security.
OpenIA (2024). GPT-4o mini: advancing cost-efficient intelligence. [link].
Patel, K., Shafiq, Z., Nogueira, M., Menasché, D., Lovat, E., Kashif, T., Woiwood, A., and Martins, M. (2024). Harnessing ti feeds for exploitation detection. In IEEE CSR.
Rastenis, J., Ramanauskaitė, S., Suzdalev, I., Tunaitytė, K., Janulevičius, J., and Čenys, A. (2021). Multi-Language Spam/Phishing Classification by Email Body Text: Toward Automated Security Incident Investigation. Electronics, 10(668).
Siegel, S. and Jr., N. J. C. (2006). Estatística não-paramétrica para ciências do comportamento. Artmed, Porto Alegre, 2 edition.
Silva, G. C. and Westphall, C. B. (2024). A survey of large language models in cybersecurity. arXiv preprint arXiv:2402.16968.
VERIS (2025). Veris: The vocabulary for event recording and incident sharing. [link].
Wu, Z., Jiang, M., and Shen, C. (2024). Get an a in math: Progressive rectification prompting. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence.
X (2024). Grok-2 Beta Release. [link].
Zhao, H., Chen, H., Ruggles, T. A., Feng, Y., Singh, D., and Yoon, H.-J. (2024). Improving text classification with large language model-based data augmentation. Electronics, (2535).
Zheng, Liu, X. et al. (2023). Progressive-hint prompting improves reasoning in large language models.
Zhou, K., Ethayarajh, K., Card, D., and Jurafsky, D. (2022). Problems with cosine as a measure of embedding similarity for high frequency words.
Cer, D., Yang, Y., yi Kong, S., Hua, N., Limtiaco, N., John, R. S., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., and Kurzweil, R. (2018). Universal sentence encoder.
CERT.br (2025). Incidentes notificados ao cert.br. [link].
Chen, J., Pan, X., Yu, D., Song, K., Wang, X., Yu, D., and Chen, J. (2023). Skills-in-context: Unlocking compositionality in large language models. arXiv preprint arXiv:2308.00304.
Chen, J., Tian, J., and Jin, Y. (2024). Self-hint prompting improves zero-shot reasoning in large language models via reflective cycle. In Proceedings of the 46th Annual Conference of the Cognitive Science Society.
Cichonski, P., Millar, T., Grance, T., and Scarfone, K. (2012). Computer security incident handling guide. Technical Report NIST Special Publication 800-61 Revision 2.
ENISA (2018). Reference incident classification taxonomy. [link].
F5 Networks (2024). Generative ai for threat modeling and incident response. [link].
FIRST (2025). First csirt services framework. [link].
Google (2025). Gemini models. [link].
Grispos, G. (2016). Cybercrime and Organizational Response: Exploring the Roles of Digital Forensics Investigations and Information Security Policy. PhD thesis.
Grispos, G., Glisson, W. B., and Storer, T. (2019). How good is your data? investigating the quality of data generated during security incident response investigations.
IBM (2024). O que é engenharia de prompt? [link].
Ibrishimova, M. D. (2019). Cyber incident classification: Issues and challenges. In Xhafa, F., Leu, F.-Y., Ficco, M., and Yang, C.-T., editors, Advances on P2P, Parallel, Grid, Cloud and Internet Computing.
Ji, H., Yang, J., Chai, L., Wei, C., Yang, L., Duan, Y., Wang, Y., Sun, T., Guo, H., Li, T., Ren, C., and Li, Z. (2024). Sevenllm: Benchmarking, eliciting, and enhancing abilities of large language models in cyber threat intelligence. arXiv preprint arXiv:2405.03446.
Kim, J.-y. and Kwon, H.-Y. (2022). Threat classification model for security information event management focusing on model efficiency. Computers & Security, 120:102789.
Li, Y., Tian, J., He, H., and Jin, Y. (2024). Hypothesis testing prompting improves deductive reasoning in large language models. arXiv preprint arXiv:2405.06707.
Lin, C.-Y. (2004). ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81.
Lukwaro, E., Kalegele, K., and Nyambo, D. (2024). A review on nlp techniques and associated challenges in extracting features from education data. International Journal of Computing and Digital Systems, 16:2210–142.
Meta (2025). LLAMA Models. [link].
Ming, Y., Yin, H., and Li, Y. (2021). On the impact of spurious correlation for out-of-distribution detection.
MITRE (2025). Att&ck matrix for enterprise. [link].
Molleti, R., Goje, V., Luthra, P., and Raghavan, P. (2024). Automated threat detection and response using llm agents. Journal of Advanced Research and Reviews, 24(2).
Nelson, A., Rekhi, S., Souppaya, M., and Scarfone, K. (2025). Incident response recommendations and considerations for cybersecurity risk management: A csf 2.0 community profile. Technical Report NIST SP 800-61r3.
OASIS (2025). Sharing threat intelligence just got a lot easier! [link].
Ogundairo, O. and Broklyn, P. (2024). Natural language processing for cybersecurity incident analysis. Journal of Cyber Security.
OpenIA (2024). GPT-4o mini: advancing cost-efficient intelligence. [link].
Patel, K., Shafiq, Z., Nogueira, M., Menasché, D., Lovat, E., Kashif, T., Woiwood, A., and Martins, M. (2024). Harnessing ti feeds for exploitation detection. In IEEE CSR.
Rastenis, J., Ramanauskaitė, S., Suzdalev, I., Tunaitytė, K., Janulevičius, J., and Čenys, A. (2021). Multi-Language Spam/Phishing Classification by Email Body Text: Toward Automated Security Incident Investigation. Electronics, 10(668).
Siegel, S. and Jr., N. J. C. (2006). Estatística não-paramétrica para ciências do comportamento. Artmed, Porto Alegre, 2 edition.
Silva, G. C. and Westphall, C. B. (2024). A survey of large language models in cybersecurity. arXiv preprint arXiv:2402.16968.
VERIS (2025). Veris: The vocabulary for event recording and incident sharing. [link].
Wu, Z., Jiang, M., and Shen, C. (2024). Get an a in math: Progressive rectification prompting. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence.
X (2024). Grok-2 Beta Release. [link].
Zhao, H., Chen, H., Ruggles, T. A., Feng, Y., Singh, D., and Yoon, H.-J. (2024). Improving text classification with large language model-based data augmentation. Electronics, (2535).
Zheng, Liu, X. et al. (2023). Progressive-hint prompting improves reasoning in large language models.
Zhou, K., Ethayarajh, K., Card, D., and Jurafsky, D. (2022). Problems with cosine as a measure of embedding similarity for high frequency words.
Published
2025-09-01
How to Cite
SEVERO, Alex Sandre Pinheiro; LAUTERT, Douglas Paim; KREUTZ, Diego; BERTHOLDO, Leandro Márcio; POHLMANN, Marcio; QUINCOZES, Silvio Ereno.
Categorization of Security Incidents Using Prompt Engineering in LLMs. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 256-272.
DOI: https://doi.org/10.5753/sbseg.2025.11399.
