Avaliação do Reconhecimento de Entidades Nomeadas para Descoberta de Dados Pessoais em Transcrições de Áudio
Resumo
O volume exponencial de dados pessoais impulsionou legislações como a LGPD no Brasil, mas a escassez de dados anotados publicamente e a natureza não estruturada de fontes como transcrições de áudio tornam desafiadores tanto o treinamento de modelos de reconhecimento de entidades nomeadas (NER) quanto a identificação confiável de dados pessoais. Para enfrentar esses desafios, este trabalho apresenta o resultado da aplicação do modelo BERTimbau, refinado com um corpus baseado em dados sintéticos. para identificar quatro entidades (Nome, CPF, RG e Endereço) e extrair as relações “Reside em” e “Possui documento”. O modelo refinado alcançou F1-score de 0,98 na tarefa de NER e 0,29 na extração de relações.Referências
ABNT (2023). ABNT NBR ISO/IEC 27005:2023 - Tecnologia da Informação – Técnicas de Segurança – Gestão de riscos de segurança da informação. International Organization for Standardization. Acesso em: 20 maio 2024.
Aggarwal, C. C. and Zhai, C., editors (2012). Mining Text Data. Springer, New York, NY, 1 edition.
Bannour, N., Wajsbürt, P., Rance, B., Tannier, X., and Névéol, A. (2022). Privacy-preserving mimic models for clinical named entity recognition in french. Journal of Biomedical Informatics, 130:104073.
Brasil (2018). Lei nº 13.709, de 14 de agosto de 2018. Diário Oficial [da] República Federativa do Brasil.
Campesato, O. (2020). Artificial Intelligence, Machine Learning, and Deep Learning. Mercury Learning and Information, Berlin, Boston.
Catelli, R., Gargiulo, F., Casola, V., De Pietro, G., Fujita, H., and Esposito, M. (2020). Crosslingual named entity recognition for clinical de-identification applied to a covid-19 italian data set. Applied Soft Computing, 97:106779.
Catelli, R., Gargiulo, F., Damiano, E., Esposito, M., and De Pietro, G. (2021). Clinical de-identification using sub-document analysis and electra. In 2021 IEEE International Conference on Digital Health (ICDH), pages 266–275.
Eisenstein, J. (2019). Introduction to natural language processing. The MIT Press.
Gartner (2019). Gartner predicts 2019 for the future of privacy. Acesso em: 17 nov. 2024.
Gultiaev, A. A. and Domashova, J. V. (2022). Developing a named entity recognition model for text documents in russian to detect personal data using machine learning methods. Procedia Computer Science, 213:127–135. 2022 Annual International Conference on Brain-Inspired Cognitive Architectures for Artificial Intelligence: The 13th Annual Meeting of the BICA Society.
Herwanto, G. B., Quirchmayr, G., and Tjoa, A. M. (2021). A named entity recognition based approach for privacy requirements engineering. In 2021 IEEE 29th International Requirements Engineering Conference Workshops (REW), pages 406–411.
Hu, Y., Li, R., Wang, S., Tao, F., and Sun, Z. (2022). Speechhide: A hybrid privacy-preserving mechanism for speech content and voiceprint in speech data sharing. In 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC), pages 345–352.
Ignaczak, L., Martins, M. G., da Costa, C. A., Donida, B., and da Silva, M. C. P. (2023). An evaluation of nerc learning-based approaches to discover personal data in brazilian portuguese documents. Discover Data, 1(1).
Moussaoui, T. E., Chakir, L., and Boumhidi, J. (2023). Preserving privacy in arabic judgments: Ai-powered anonymization for enhanced legal data privacy. IEEE Access, 11:117851–117864.
Neves, M. (2022). O que são dados e por que eles são importantes? Nubank Blog. Acesso em: 20 maio 2024.
Nikolenko, S. I. (2019). Synthetic data for deep learning. arXiv preprint arXiv:1909.11512.
Silva, P., Gonçalves, C., Godinho, C., Antunes, N., and Curado, M. (2020). Using nlp and machine learning to detect data privacy violations. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 972–977.
Wongvises, C., Khurat, A., and Noraset, T. (2022). Thai privacy notice analysis based on named-entity recognition technique. In 2022 26th International Computer Science and Engineering Conference (ICSEC), pages 257–262.
Zhang, B., Yao, X., Li, H., and Aini, M. (2023). Chinese medical named entity recognition based on expert knowledge and fine-tuning bert. In 2023 IEEE International Conference on Knowledge Graph (ICKG), page 84–90. IEEE.
Aggarwal, C. C. and Zhai, C., editors (2012). Mining Text Data. Springer, New York, NY, 1 edition.
Bannour, N., Wajsbürt, P., Rance, B., Tannier, X., and Névéol, A. (2022). Privacy-preserving mimic models for clinical named entity recognition in french. Journal of Biomedical Informatics, 130:104073.
Brasil (2018). Lei nº 13.709, de 14 de agosto de 2018. Diário Oficial [da] República Federativa do Brasil.
Campesato, O. (2020). Artificial Intelligence, Machine Learning, and Deep Learning. Mercury Learning and Information, Berlin, Boston.
Catelli, R., Gargiulo, F., Casola, V., De Pietro, G., Fujita, H., and Esposito, M. (2020). Crosslingual named entity recognition for clinical de-identification applied to a covid-19 italian data set. Applied Soft Computing, 97:106779.
Catelli, R., Gargiulo, F., Damiano, E., Esposito, M., and De Pietro, G. (2021). Clinical de-identification using sub-document analysis and electra. In 2021 IEEE International Conference on Digital Health (ICDH), pages 266–275.
Eisenstein, J. (2019). Introduction to natural language processing. The MIT Press.
Gartner (2019). Gartner predicts 2019 for the future of privacy. Acesso em: 17 nov. 2024.
Gultiaev, A. A. and Domashova, J. V. (2022). Developing a named entity recognition model for text documents in russian to detect personal data using machine learning methods. Procedia Computer Science, 213:127–135. 2022 Annual International Conference on Brain-Inspired Cognitive Architectures for Artificial Intelligence: The 13th Annual Meeting of the BICA Society.
Herwanto, G. B., Quirchmayr, G., and Tjoa, A. M. (2021). A named entity recognition based approach for privacy requirements engineering. In 2021 IEEE 29th International Requirements Engineering Conference Workshops (REW), pages 406–411.
Hu, Y., Li, R., Wang, S., Tao, F., and Sun, Z. (2022). Speechhide: A hybrid privacy-preserving mechanism for speech content and voiceprint in speech data sharing. In 2022 7th IEEE International Conference on Data Science in Cyberspace (DSC), pages 345–352.
Ignaczak, L., Martins, M. G., da Costa, C. A., Donida, B., and da Silva, M. C. P. (2023). An evaluation of nerc learning-based approaches to discover personal data in brazilian portuguese documents. Discover Data, 1(1).
Moussaoui, T. E., Chakir, L., and Boumhidi, J. (2023). Preserving privacy in arabic judgments: Ai-powered anonymization for enhanced legal data privacy. IEEE Access, 11:117851–117864.
Neves, M. (2022). O que são dados e por que eles são importantes? Nubank Blog. Acesso em: 20 maio 2024.
Nikolenko, S. I. (2019). Synthetic data for deep learning. arXiv preprint arXiv:1909.11512.
Silva, P., Gonçalves, C., Godinho, C., Antunes, N., and Curado, M. (2020). Using nlp and machine learning to detect data privacy violations. In IEEE INFOCOM 2020 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS), pages 972–977.
Wongvises, C., Khurat, A., and Noraset, T. (2022). Thai privacy notice analysis based on named-entity recognition technique. In 2022 26th International Computer Science and Engineering Conference (ICSEC), pages 257–262.
Zhang, B., Yao, X., Li, H., and Aini, M. (2023). Chinese medical named entity recognition based on expert knowledge and fine-tuning bert. In 2023 IEEE International Conference on Knowledge Graph (ICKG), page 84–90. IEEE.
Publicado
01/09/2025
Como Citar
MUNHOS, Carlos André Misiuk; IGNACZAK, Luciano.
Avaliação do Reconhecimento de Entidades Nomeadas para Descoberta de Dados Pessoais em Transcrições de Áudio. In: WORKSHOP DE TRABALHOS DE INICIAÇÃO CIENTÍFICA E DE GRADUAÇÃO - SIMPÓSIO BRASILEIRO DE CIBERSEGURANÇA (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 192-203.
DOI: https://doi.org/10.5753/sbseg_estendido.2025.10728.
