Mascaramento por Agrupamento e Rotulagem com LLMs para Compartilhamento de Datasets de Incidentes em Redes

Breno Valente Manhães; Guilherme A. Thomaz; Miguel Elias M. Campista

doi:10.5753/sbrc.2026.19403

Breno Valente Manhães UFRJ
Guilherme A. Thomaz UFRJ
Miguel Elias M. Campista UFRJ

DOI: https://doi.org/10.5753/sbrc.2026.19403

Abstract

The dissemination of network security datasets is often limited by sensitive attributes in textual logs generated by tools such as OpenVAS and Nessus. This paper proposes the MECAL (Masking via Embedding Clustering and Automated Labeling) algorithm to anonymize these attributes while preserving their utility. The method utilizes Transformers to semantically cluster incident descriptions and employs Large Language Models (LLMs) to generate high-level generic labels for each cluster. Results demonstrate that replacing the original texts with the generated labels improves data quality, evidenced by improvements in F1-Score and Mutual Information metrics, enabling the secure sharing of cyber defense information.

References

Anthropic (2024). Clio: Privacy-preserving insights into real-world ai use. Technical report, Anthropic Research.

Aufschläger, R., Wilhelm, S., Heigl, M., and Schramm, M. (2024). Clustem4ano: Clustering text embeddings of nominal textual attributes for microdata anonymization. In Proceedings of the 28th International Database Engineering & Applications Symposium (IDEAS). arXiv:2412.12649.

Cerda, P., Varoquaux, G., and Kégl, B. (2018). Similarity encoding for learning with dirty categorical variables. Machine Learning, 107(8):1477–1494.

Deußer, T., Sparrenberg, L., Berger, A., Hahnbück, M., Bauckhage, C., and Sifa, R. (2025). A survey on current trends and recent advances in text anonymization. arXiv preprint arXiv:2508.21587.

Fung, B. C., Wang, K., Chen, R., and Yu, P. S. (2010). Privacy-preserving data publishing: A survey of recent developments. ACM Computing Surveys (CSUR), 42(4):1–53.

Garg, S. and Torra, V. (2023). K-anonymous privacy preserving manifold learning. In Proceedings of the 15th International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management (IC3K). Umeå University, SciTe-Press.

Grootendorst, M. (2025). Summaries as centroids for interpretable and scalable text clustering. arXiv preprint arXiv:2502.09667.

Hugging Face (2021). Hugging face model hub: all-minilm-l6-v2. [link].

Machado, B., Lautert, D., Kapelinski, C., and Kreutz, D. (2025). Structured extraction of vulnerabilities in openvas and tenable was reports using llms. arXiv preprint arXiv:2511.15745.

Machanavajjhala, A., Kifer, D., Gehrke, J., and Venkitasubramaniam, M. (2007). l-diversity: Privacy beyond k-anonymity. ACM Transactions on Knowledge Discovery from Data (TKDD), 1(1):3:1–3:52.

Pilán, I., Manzanares-Salor, B., Sánchez, D., and Lison, P. (2025). Truthful text sanitization guided by inference attacks. arXiv preprint arXiv:2412.12928.

Reimers, N. and Gurevych, I. (2019). Sentence-bert: Sentence embeddings using siamese bert-networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing.

Ring, M., Wunderlich, S., Scheuring, D., Landes, D., and Hotho, A. (2019). A survey of network-based intrusion detection data sets. Computers & Security, 86:147–167.

Rose, S., Engel, D., Cramer, N., and Cowley, W. (2010). Automatic keyword extraction from individual documents. In Text Mining: Applications and Theory, pages 1–20. John Wiley & Sons, Ltd.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell system technical journal, 27(3):379–423.

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570.

Wagner, T. D., Mahbub, K., Palomar, E., and Abdallah, A. E. (2019). Cyber threat intelligence sharing: Survey and research directions. Computers & Security, 87:101589.

Zhang, Y. and Li, X. (2025). Sdlog: A deep learning framework for detecting sensitive information in software logs. arXiv preprint arXiv:2505.14976.

Masking Through Clustering and Labeling with LLMs for Sharing Network Incident Datasets

Abstract

References

Most read articles by the same author(s)