AnonLFI 2.0: Extensible Architecture for PII Pseudonymization in CSIRTs with OCR and Technical Recognizers

  • Cristhian Kapelinski UNIPAMPA
  • Douglas Lautert UNIPAMPA
  • Beatriz Machado UNIPAMPA
  • Diego Kreutz UNIPAMPA

Resumo


This work presents AnonLFI 2.0, a modular pseudonymization framework for CSIRTs that employs HMAC–SHA256 to generate strong and reversible pseudonyms, preserves the structural integrity of XML and JSON documents, and integrates OCR as well as specialized technical recognizers for PII and security–related artifacts. In two case studies involving OCR applied to PDF files and the pseudonymization of OpenVAS XML reports, the system achieved 100% precision and F1 scores of 76.5% and 92.13%, demonstrating its effectiveness for the secure preparation of complex cybersecurity datasets.

Referências

Amoo, O. O., Atadoga, A., Osasona, F., Abrahams, T. O., Ayinla, B. S., and Farayola, O. A. (2024). GDPR’s impact on cybersecurity: A review focusing on USA and European practices. International Journal of Science and Research Archive, 11:1338–1347.

Bandel, C. T., Esteves, J. P. R., Guerra, K. P., Bertholdo, L. M., Kreutz, D., and Miani, R. S. (2025). Anonimização de incidentes de segurança com reidentificação controlada. In Anais do XXV SBSeg. SBC.

Baumgartner, M., Kreiner, K., Wiesmüller, F., Hayn, D., Puelacher, C., and Schreier, G. (2024). Masketeer: An ensemble-based pseudonymization tool with entity recognition for german unstructured medical free text. Future Internet, 16(8):281.

Blanco-Medina, P., Fidalgo, E., Alegre, E., Alaiz-Rodríguez, R., Jáñez-Martino, F., and Bonnici, A. (2020). Rectification and super-resolution enhancements for forensic text recognition. Sensors, 20(20):5850.

Kapelinski, C., Lautert, D., Machado, B., and Kreutz, D. (2025). AnonLFI 2.0: Extensible architecture for PII pseudonymization in CSIRTs with OCR and technical recognizers. arXiv preprint arXiv:2511.15744.

Ma, P., Jiang, B., Lu, Z., Li, N., and Jiang, Z. (2020). Cybersecurity named entity recognition using bidirectional long short-term memory with conditional random fields. Tsinghua Science and Technology, 26:259–265.

Machado, B., Lautert, D., Kapelinski, C., and Kreutz, D. (2025). Structured extraction of vulnerabilities in openvas and tenable was reports using llms. [link].

Preuveneers, D. and Joosen, W. (2021). Sharing machine learning models as indicators of compromise for cyber threat intelligence. J. of Cybersecurity and Privacy, 1(1).

Randall, S. M., Ferrante, A. M., Boyd, J. H., Bauer, J. K., and Semmens, J. B. (2014). Privacy-preserving record linkage on large real world datasets. J. of Biomed Informatics.

Sarhan, M., Layeghy, S., Moustafa, N., and Portmann, M. (2022). Cyber threat intelligence sharing scheme based on federated learning for network intrusion detection. Journal of Network and Systems Management, 31.

Slijepčević, D., Henzl, M., Klausner, L., Dam, T., Kieseberg, P., and Zeppelzauer, M. (2021). k-Anonymity in practice: How generalisation and suppression affect machine learning classifiers. Computers & Security, 111:102469.

Sweeney, L. (2002). k-anonymity: A model for protecting privacy. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 10(05):557–570.

Vakili, T., Henriksson, A., and Dalianis, H. (2024). End-to-end pseudonymization of fine-tuned clinical BERT models: Privacy preservation with maintained data utility. BMC Medical Informatics and Decision Making, 24(162).

Yermilov, O., Raheja, V., and Chernodub, A. (2023). Privacy-and utility-preserving nlp with anonymized data: A case study of pseudonymization. In TrustNLP, pages 232–241.

Zhang, Y., Liu, J., Zhong, X., and Wu, L. (2025). SecLMNER: A framework for enhanced named entity recognition in multi-source cybersecurity data using large language models. Expert Systems with Applications, 271:126651.
Publicado
08/12/2025
KAPELINSKI, Cristhian; LAUTERT, Douglas; MACHADO, Beatriz; KREUTZ, Diego. AnonLFI 2.0: Extensible Architecture for PII Pseudonymization in CSIRTs with OCR and Technical Recognizers. In: ESCOLA REGIONAL DE REDES DE COMPUTADORES (ERRC), 22. , 2025, Porto Alegre/RS. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 81-87. DOI: https://doi.org/10.5753/errc.2025.17784.