Simplificação da análise forense de logs utilizando Grandes Modelos de Linguagem com a técnica RAG
Resumo
Na era digital, a crescente complexidade dos sistemas informáticos e a sofisticação dos ataques cibernéticos aumentam significativamente o volume de logs gerados, desafiando os profissionais de cibersegurança. A detecção e interpretação de ataques ou problemas nesses registros são essenciais para uma resposta rápida a incidentes de segurança. Nesse contexto, os Grandes Modelos de Linguagem (LLMs) destacam-se como ferramentas fundamentais na compreensão e geração de linguagem natural. Este estudo apresenta uma abordagem para analisar logs de sistemas e redes, visando detectar, correlacionar e interpretar anomalias, utilizando a técnica de Retrieval-Augmented Generation (RAG) com LLMs e interações por meio de perguntas direcionadas. Os resultados comprovam a eficácia da abordagem proposta em gerar informações relevantes e simplificar a análise forense para profissionais da área.Referências
Adnan, K. and Akbar, R. (2019). An analytical study of information extraction from unstructured and multidimensional big data. Journal of Big Data, 6(1):1–38.
Ahmad, R., Alsmadi, I., Alhamdani, W., and Tawalbeh, L. (2023). Zero-day attack detection: a systematic literature review. Artificial Intelligence Review, 56(10):10733–10811.
Da Silva, E. H. M., dos Santos, E. M. F., de Barros Monteiro, M. L., Bezerra, S. L., and de Miranda, S. C. (2024). Chattcu: Inteligência artificial como assistente do auditor. Revista do TCU, 153:19–45.
Fan, H. and Qin, Y. (2018/05). Research on text classification based on improved tf-idf algorithm. In Proceedings of the 2018 International Conference on Network, Communication, Computer Engineering (NCCE 2018), pages 501–506. Atlantis Press.
Finardi, P., Avila, L., Castaldoni, R., Gengo, P., Larcher, C., Piau, M., Costa, P., and Caridá, V. (2024). The chronicles of rag: The retriever, the chunk and the generator.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-augmented generation for large language models: A survey.
Ge, T., Jing, H., Wang, L., Wang, X., Chen, S.-Q., and Wei, F. (2024). In-context autoencoder for context compression in a large language model. In The Twelfth International Conference on Learning Representations.
Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., et al. (2023). Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints.
Hu, J., Zhou, Y., and Wang, J. (2024). Intrinsic evaluation of rag systems for deep-logic questions.
IBM (2023). What is langchain? Disponível em: [link]. Acesso em: 19 de Janeiro de 2025.
International Telecommunication Union (ITU) (2023). Global offline population steadily declines to 2.6 billion people in 2023. Disponível em: [link]. Acesso em: 03 de Janeiro de 2025.
Jeong, C. (2023). A study on the implementation of generative ai services using an enterprise data-based llm application architecture. Advances in Artificial Intelligence and Machine Learning, 03(04):1588–1618.
Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., and Qiu, L. (2023). LLMLingua: Compressing prompts for accelerated inference of large language models. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore. Association for Computational Linguistics.
Juvekar, K. and Purwar, A. (2024). Introducing a new hyper-parameter for rag: Context window utilization.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
Kwon, D., Kim, H., Kim, J., Suh, S. C., Kim, I., and Kim, K. J. (2019). A survey of deep learning-based network anomaly detection. Cluster Computing, 22:949–961.
LangChain (2024). Select by similarity. Disponível em: [link]. Acesso em: 19 de Janeiro de 2025.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
Li, X., Tang, H., Chen, S., Wang, Z., Maravi, A., and Abram, M. (2023). Context matters: Data-efficient augmentation of large language models for scientific applications.
Liu, F., Kang, Z., and Han, X. (2024). Optimizing rag techniques for automotive industry pdf chatbots: A case study with locally deployed ollama models.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. (2023). G-eval: Nlg evaluation using gpt-4 with better human alignment.
Maryamah, M., Irfani, M. M., Tri Raharjo, E. B., Rahmi, N. A., Ghani, M., and Raharjana, I. K. (2024). Chatbots in academia: A retrieval-augmented generation approach for improved efficient information access. In 2024 16th International Conference on Knowledge and Smart Technology (KST), pages 259–264.
Melz, E. (2023). Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation.
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., and Mian, A. (2024). A comprehensive overview of large language models.
Nayerifard, T., Amintoosi, H., Bafghi, A. G., and Dehghantanha, A. (2023). Machine learning in digital forensics: A systematic literature review.
Oliner, A., Ganapathi, A., and Xu, W. (2012). Advances and challenges in log analysis. Communications of the ACM, 55(2):55–61.
Padilha, R., Theóphilo, A., Andaló, F. A., Vega-Oliveros, D. A., Cardenuto, J. P., Bertocco, G., Nascimento, J., Yang, J., and Rocha, A. (2021). A inteligência artificial e os desafios da ciência forense digital no século xxi. Estudos Avançados, 35(101):113–138.
Petukhova, A., Matos-Carvalho, J. P., and Fachada, N. (2025). Text clustering with large language model embeddings. International Journal of Cognitive Computing in Engineering, 6:100–108.
Rahutomo, F., Kitasuka, T., Aritsugi, M., et al. (2012). Semantic cosine similarity. In The 7th international student conference on advanced science and technology ICAST, volume 4, page 1. University of Seoul South Korea.
Rau, D., Wang, S., Déjean, H., and Clinchant, S. (2024). Context embeddings for efficient answer generation in rag.
Sawarkar, K., Mangal, A., and Solanki, S. R. (2024). Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers.
Silva, E. M. D. and Avanço, L. (2024). Visibilidade em cibersegurança: Uma pesquisa exploratória. In 20th CONTECSI-INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGY MANAGEMENT VIRTUAL.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.
Vazquez, F. J. B. (2024). Política de resposta a incidentes cibernéticos e estratégias de aderência à legislação brasileira. Dataset Reports, 3(1):114–119.
Wang, Z., Liu, J., Zhang, S., and Yang, Y. (2024). Poisoned langchain: Jailbreak llms by langchain.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Yang, H., Zhang, M., Wei, D., and Guo, J. (2024). Srag: Speech retrieval augmented generation for spoken language understanding. In 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT), pages 370–374.
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. (2024). A survey on multimodal large language models. National Science Review, 11(12).
Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F., Yang, L., Zhang, W., Jiang, J., and Cui, B. (2024). Retrieval-augmented generation for ai-generated content: A survey.
Şakar, T. and Emekci, H. (2025). Maximizing rag efficiency: A comparative analysis of rag methods. Natural Language Processing, 31(1):1–25.
Ahmad, R., Alsmadi, I., Alhamdani, W., and Tawalbeh, L. (2023). Zero-day attack detection: a systematic literature review. Artificial Intelligence Review, 56(10):10733–10811.
Da Silva, E. H. M., dos Santos, E. M. F., de Barros Monteiro, M. L., Bezerra, S. L., and de Miranda, S. C. (2024). Chattcu: Inteligência artificial como assistente do auditor. Revista do TCU, 153:19–45.
Fan, H. and Qin, Y. (2018/05). Research on text classification based on improved tf-idf algorithm. In Proceedings of the 2018 International Conference on Network, Communication, Computer Engineering (NCCE 2018), pages 501–506. Atlantis Press.
Finardi, P., Avila, L., Castaldoni, R., Gengo, P., Larcher, C., Piau, M., Costa, P., and Caridá, V. (2024). The chronicles of rag: The retriever, the chunk and the generator.
Gao, Y., Xiong, Y., Gao, X., Jia, K., Pan, J., Bi, Y., Dai, Y., Sun, J., Wang, M., and Wang, H. (2024). Retrieval-augmented generation for large language models: A survey.
Ge, T., Jing, H., Wang, L., Wang, X., Chen, S.-Q., and Wei, F. (2024). In-context autoencoder for context compression in a large language model. In The Twelfth International Conference on Learning Representations.
Hadi, M. U., Qureshi, R., Shah, A., Irfan, M., Zafar, A., Shaikh, M. B., Akhtar, N., Wu, J., Mirjalili, S., et al. (2023). Large language models: a comprehensive survey of its applications, challenges, limitations, and future prospects. Authorea Preprints.
Hu, J., Zhou, Y., and Wang, J. (2024). Intrinsic evaluation of rag systems for deep-logic questions.
IBM (2023). What is langchain? Disponível em: [link]. Acesso em: 19 de Janeiro de 2025.
International Telecommunication Union (ITU) (2023). Global offline population steadily declines to 2.6 billion people in 2023. Disponível em: [link]. Acesso em: 03 de Janeiro de 2025.
Jeong, C. (2023). A study on the implementation of generative ai services using an enterprise data-based llm application architecture. Advances in Artificial Intelligence and Machine Learning, 03(04):1588–1618.
Jiang, H., Wu, Q., Lin, C.-Y., Yang, Y., and Qiu, L. (2023). LLMLingua: Compressing prompts for accelerated inference of large language models. In Bouamor, H., Pino, J., and Bali, K., editors, Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 13358–13376, Singapore. Association for Computational Linguistics.
Juvekar, K. and Purwar, A. (2024). Introducing a new hyper-parameter for rag: Context window utilization.
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., and Iwasawa, Y. (2022). Large language models are zero-shot reasoners. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 22199–22213. Curran Associates, Inc.
Kwon, D., Kim, H., Kim, J., Suh, S. C., Kim, I., and Kim, K. J. (2019). A survey of deep learning-based network anomaly detection. Cluster Computing, 22:949–961.
LangChain (2024). Select by similarity. Disponível em: [link]. Acesso em: 19 de Janeiro de 2025.
Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V., Goyal, N., Küttler, H., Lewis, M., Yih, W.-t., Rocktäschel, T., Riedel, S., and Kiela, D. (2020). Retrieval-augmented generation for knowledge-intensive nlp tasks. In Proceedings of the 34th International Conference on Neural Information Processing Systems, NIPS ’20, Red Hook, NY, USA. Curran Associates Inc.
Li, X., Tang, H., Chen, S., Wang, Z., Maravi, A., and Abram, M. (2023). Context matters: Data-efficient augmentation of large language models for scientific applications.
Liu, F., Kang, Z., and Han, X. (2024). Optimizing rag techniques for automotive industry pdf chatbots: A case study with locally deployed ollama models.
Liu, Y., Iter, D., Xu, Y., Wang, S., Xu, R., and Zhu, C. (2023). G-eval: Nlg evaluation using gpt-4 with better human alignment.
Maryamah, M., Irfani, M. M., Tri Raharjo, E. B., Rahmi, N. A., Ghani, M., and Raharjana, I. K. (2024). Chatbots in academia: A retrieval-augmented generation approach for improved efficient information access. In 2024 16th International Conference on Knowledge and Smart Technology (KST), pages 259–264.
Melz, E. (2023). Enhancing llm intelligence with arm-rag: Auxiliary rationale memory for retrieval augmented generation.
Naveed, H., Khan, A. U., Qiu, S., Saqib, M., Anwar, S., Usman, M., Akhtar, N., Barnes, N., and Mian, A. (2024). A comprehensive overview of large language models.
Nayerifard, T., Amintoosi, H., Bafghi, A. G., and Dehghantanha, A. (2023). Machine learning in digital forensics: A systematic literature review.
Oliner, A., Ganapathi, A., and Xu, W. (2012). Advances and challenges in log analysis. Communications of the ACM, 55(2):55–61.
Padilha, R., Theóphilo, A., Andaló, F. A., Vega-Oliveros, D. A., Cardenuto, J. P., Bertocco, G., Nascimento, J., Yang, J., and Rocha, A. (2021). A inteligência artificial e os desafios da ciência forense digital no século xxi. Estudos Avançados, 35(101):113–138.
Petukhova, A., Matos-Carvalho, J. P., and Fachada, N. (2025). Text clustering with large language model embeddings. International Journal of Cognitive Computing in Engineering, 6:100–108.
Rahutomo, F., Kitasuka, T., Aritsugi, M., et al. (2012). Semantic cosine similarity. In The 7th international student conference on advanced science and technology ICAST, volume 4, page 1. University of Seoul South Korea.
Rau, D., Wang, S., Déjean, H., and Clinchant, S. (2024). Context embeddings for efficient answer generation in rag.
Sawarkar, K., Mangal, A., and Solanki, S. R. (2024). Blended rag: Improving rag (retriever-augmented generation) accuracy with semantic search and hybrid query-based retrievers.
Silva, E. M. D. and Avanço, L. (2024). Visibilidade em cibersegurança: Uma pesquisa exploratória. In 20th CONTECSI-INTERNATIONAL CONFERENCE ON INFORMATION SYSTEMS AND TECHNOLOGY MANAGEMENT VIRTUAL.
Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. (2023). Llama: Open and efficient foundation language models.
Vazquez, F. J. B. (2024). Política de resposta a incidentes cibernéticos e estratégias de aderência à legislação brasileira. Dataset Reports, 3(1):114–119.
Wang, Z., Liu, J., Zhang, S., and Yang, Y. (2024). Poisoned langchain: Jailbreak llms by langchain.
Wei, J., Wang, X., Schuurmans, D., Bosma, M., ichter, b., Xia, F., Chi, E., Le, Q. V., and Zhou, D. (2022). Chain-of-thought prompting elicits reasoning in large language models. In Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., and Oh, A., editors, Advances in Neural Information Processing Systems, volume 35, pages 24824–24837. Curran Associates, Inc.
Yang, H., Zhang, M., Wei, D., and Guo, J. (2024). Srag: Speech retrieval augmented generation for spoken language understanding. In 2024 IEEE 2nd International Conference on Control, Electronics and Computer Technology (ICCECT), pages 370–374.
Yin, S., Fu, C., Zhao, S., Li, K., Sun, X., Xu, T., and Chen, E. (2024). A survey on multimodal large language models. National Science Review, 11(12).
Zhao, P., Zhang, H., Yu, Q., Wang, Z., Geng, Y., Fu, F., Yang, L., Zhang, W., Jiang, J., and Cui, B. (2024). Retrieval-augmented generation for ai-generated content: A survey.
Şakar, T. and Emekci, H. (2025). Maximizing rag efficiency: A comparative analysis of rag methods. Natural Language Processing, 31(1):1–25.
Publicado
01/09/2025
Como Citar
BARROS, Carlos G. L.; LIMA, João P. A.; ARRUDA, Alexandre; SOUSA, Rubens Abraão da Silva; BANDEIRA, Alan Portela.
Simplificação da análise forense de logs utilizando Grandes Modelos de Linguagem com a técnica RAG. In: SIMPÓSIO BRASILEIRO DE CIBERSEGURANÇA (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 839-854.
DOI: https://doi.org/10.5753/sbseg.2025.10682.
