Mascaramento por Agrupamento e Rotulagem com LLMs para Compartilhamento de Datasets de Incidentes em Redes
Resumo
A disseminação de bases de dados de segurança de redes é frequentemente limitada por atributos sensíveis presentes em logs textuais de ferramentas como OpenVAS e Nessus. Este trabalho propõe o algoritmo MECAL (Mascaramento por Clusterização de Embeddings e Rotulagem Automática) para anonimizar esses atributos preservando sua utilidade. O método utiliza Transformers para agrupar semanticamente as descrições de incidentes e emprega LLMs (Large Language Models) para gerar rótulos genéricos de alto nível para cada grupo. Os resultados demonstram que a substituição dos textos originais por rótulos gerados melhora a qualidade dos dados, como evidenciado pelo aumento das métricas de F1-Score e Mutual Information, viabilizando o compartilhamento seguro de informações de defesa cibernética.Referências
Anthropic (2024). Clio: Privacy-preserving insights into real-world ai use. Technical report, Anthropic Research.
Aufschläger, R., Wilhelm, S., Heigl, M., and Schramm, M. (2024). Clustem4ano: Clustering text embeddings of nominal textual attributes for microdata anonymization. In Proceedings of the 28th International Database Engineering & Applications Symposium (IDEAS). arXiv:2412.12649.
Cerda, P., Varoquaux, G., and Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 107(8):1477–1494.
Deußer, T., Sparrenberg, L., Berger, A., Hahnbück, M., Bauckhage, C., and Sifa, R. (2025). A survey on current trends and recent advances in text anonymization. arXiv preprint arXiv:2508.21587.
Fung, B. C., Wang, K., Chen, R., and Yu, P. S. (2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 42(4):1–53.
Garg, S. and Torra, V. (2023). K-anonymous privacy preserving manifold learning. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). Umeå University, SciTe-Press.
Grootendorst, M. (2025). Summaries as centroids for interpretable and scalable text clustering. arXiv preprint arXiv:2502.09667.
Hugging Face (2021). Hugging face model hub: all-minilm-l6-v2. [link].
Machado, B., Lautert, D., Kapelinski, C., and Kreutz, D. (2025). Structured extraction of vulnerabilities in openvas and tenable was reports using llms. arXiv preprint arXiv:2511.15745.
Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3:1–3:52.
Pilán, I., Manzanares-Salor, B., Sánchez, D., and Lison, P. (2025). Truthful text sanitization guided by inference attacks. arXiv preprint arXiv:2412.12928.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
Ring, M., Wunderlich, S., Scheuring, D., Landes, D., and Hotho, A. (2019). A survey of network-based intrusion detection data sets. Computers & Security, 86:147–167.
Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010). Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory, pages 1–20. John Wiley & Sons, Ltd.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570.
Wagner, T. D., Mahbub, K., Palomar, E., and Abdallah, A. E. (2019). Cyber threat intelligence sharing: Survey and research directions. Computers & Security, 87:101589.
Zhang, Y. and Li, X. (2025). Sdlog: A deep learning framework for detecting sensitive information in software logs. arXiv preprint arXiv:2505.14976.
Aufschläger, R., Wilhelm, S., Heigl, M., and Schramm, M. (2024). Clustem4ano: Clustering text embeddings of nominal textual attributes for microdata anonymization. In Proceedings of the 28th International Database Engineering & Applications Symposium (IDEAS). arXiv:2412.12649.
Cerda, P., Varoquaux, G., and Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 107(8):1477–1494.
Deußer, T., Sparrenberg, L., Berger, A., Hahnbück, M., Bauckhage, C., and Sifa, R. (2025). A survey on current trends and recent advances in text anonymization. arXiv preprint arXiv:2508.21587.
Fung, B. C., Wang, K., Chen, R., and Yu, P. S. (2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 42(4):1–53.
Garg, S. and Torra, V. (2023). K-anonymous privacy preserving manifold learning. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). Umeå University, SciTe-Press.
Grootendorst, M. (2025). Summaries as centroids for interpretable and scalable text clustering. arXiv preprint arXiv:2502.09667.
Hugging Face (2021). Hugging face model hub: all-minilm-l6-v2. [link].
Machado, B., Lautert, D., Kapelinski, C., and Kreutz, D. (2025). Structured extraction of vulnerabilities in openvas and tenable was reports using llms. arXiv preprint arXiv:2511.15745.
Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3:1–3:52.
Pilán, I., Manzanares-Salor, B., Sánchez, D., and Lison, P. (2025). Truthful text sanitization guided by inference attacks. arXiv preprint arXiv:2412.12928.
Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.
Ring, M., Wunderlich, S., Scheuring, D., Landes, D., and Hotho, A. (2019). A survey of network-based intrusion detection data sets. Computers & Security, 86:147–167.
Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010). Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory, pages 1–20. John Wiley & Sons, Ltd.
Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.
Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570.
Wagner, T. D., Mahbub, K., Palomar, E., and Abdallah, A. E. (2019). Cyber threat intelligence sharing: Survey and research directions. Computers & Security, 87:101589.
Zhang, Y. and Li, X. (2025). Sdlog: A deep learning framework for detecting sensitive information in software logs. arXiv preprint arXiv:2505.14976.
Publicado
25/05/2026
Como Citar
MANHÃES, Breno Valente; THOMAZ, Guilherme A.; CAMPISTA, Miguel Elias M..
Mascaramento por Agrupamento e Rotulagem com LLMs para Compartilhamento de Datasets de Incidentes em Redes. In: SIMPÓSIO BRASILEIRO DE REDES DE COMPUTADORES E SISTEMAS DISTRIBUÍDOS (SBRC), 44. , 2026, Praia do Forte/BA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2026
.
p. 954-967.
ISSN 2177-9384.
DOI: https://doi.org/10.5753/sbrc.2026.19403.
