Mascaramento por Agrupamento e Rotulagem com LLMs para Compartilhamento de Datasets de Incidentes em Redes

Breno Valente Manhães; Guilherme A. Thomaz; Miguel Elias M. Campista

doi:10.5753/sbrc.2026.19403

Breno Valente Manhães UFRJ
Guilherme A. Thomaz UFRJ
Miguel Elias M. Campista UFRJ

DOI: https://doi.org/10.5753/sbrc.2026.19403

Resumo

A disseminação de bases de dados de segurança de redes é frequentemente limitada por atributos sensíveis presentes em logs textuais de ferramentas como OpenVAS e Nessus. Este trabalho propõe o algoritmo MECAL (Mascaramento por Clusterização de Embeddings e Rotulagem Automática) para anonimizar esses atributos preservando sua utilidade. O método utiliza Transformers para agrupar semanticamente as descrições de incidentes e emprega LLMs (Large Language Models) para gerar rótulos genéricos de alto nível para cada grupo. Os resultados demonstram que a substituição dos textos originais por rótulos gerados melhora a qualidade dos dados, como evidenciado pelo aumento das métricas de F1-Score e Mutual Information, viabilizando o compartilhamento seguro de informações de defesa cibernética.

Referências

Anthropic (2024). Clio: Privacy-preserving insights into real-world ai use. Technical report, Anthropic Research.

Aufschläger, R., Wilhelm, S., Heigl, M., and Schramm, M. (2024). Clustem4ano: Clustering text embeddings of nominal textual attributes for microdata anonymization. In Proceedings of the 28th International Database Engineering & Applications Symposium (IDEAS). arXiv:2412.12649.

Cerda, P., Varoquaux, G., and Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 107(8):1477–1494.

Deußer, T., Sparrenberg, L., Berger, A., Hahnbück, M., Bauckhage, C., and Sifa, R. (2025). A survey on current trends and recent advances in text anonymization. arXiv preprint arXiv:2508.21587.

Fung, B. C., Wang, K., Chen, R., and Yu, P. S. (2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 42(4):1–53.

Garg, S. and Torra, V. (2023). K-anonymous privacy preserving manifold learning. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). Umeå University, SciTe-Press.

Grootendorst, M. (2025). Summaries as centroids for interpretable and scalable text clustering. arXiv preprint arXiv:2502.09667.

Hugging Face (2021). Hugging face model hub: all-minilm-l6-v2. [link].

Machado, B., Lautert, D., Kapelinski, C., and Kreutz, D. (2025). Structured extraction of vulnerabilities in openvas and tenable was reports using llms. arXiv preprint arXiv:2511.15745.

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3:1–3:52.

Pilán, I., Manzanares-Salor, B., Sánchez, D., and Lison, P. (2025). Truthful text sanitization guided by inference attacks. arXiv preprint arXiv:2412.12928.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

Ring, M., Wunderlich, S., Scheuring, D., Landes, D., and Hotho, A. (2019). A survey of network-based intrusion detection data sets. Computers & Security, 86:147–167.

Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010). Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory, pages 1–20. John Wiley & Sons, Ltd.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570.

Wagner, T. D., Mahbub, K., Palomar, E., and Abdallah, A. E. (2019). Cyber threat intelligence sharing: Survey and research directions. Computers & Security, 87:101589.

Zhang, Y. and Li, X. (2025). Sdlog: A deep learning framework for detecting sensitive information in software logs. arXiv preprint arXiv:2505.14976.

Mascaramento por Agrupamento e Rotulagem com LLMs para Compartilhamento de Datasets de Incidentes em Redes

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)