Sensitive Data Protection in Police Reports: De-identification Techniques and Applications in Machine Learning
Resumo
Este trabalho propõe uma metodologia para identificação e desidentificação de dados sensíveis em boletins de ocorrência por meio de técnicas de reconhecimento de entidades nomeadas (NER). São comparados dois modelos: o BERTimbau, baseado em transformers e treinado em português brasileiro, e o BiLSTM, com arquitetura recorrente tradicional. Os resultados indicaram que o BERTimbau obteve desempenho superior em F1-score macro e maior eficácia na desidentificação, especialmente em entidades minoritárias. O estudo reforça a necessidade de modelos contextuais e métricas robustas para garantir a privacidade em dados de segurança pública.
Referências
Brasil (2018). Lei geral de proteção de dados pessoais. Lei n.º 13.709/2018.
Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert-Voss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, U., Oprea, A., e Raffel, C. (2021). Extracting training data from large language models.
Catelli, R., Casola, V., De Pietro, G., Fujita, H., e Esposito, M. (2021). Combining contextualized word representation and sub-document level analysis through bi-lstm+crf architecture for clinical de-identification. Knowledge-Based Systems, 213:106649.
Chollet, F. et al. (2015). Keras. [link].
Cortes, C. e Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3):273–297.
Devlin, J., Chang, M.-W., Lee, K., e Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein, J., Doran, C., e Solorio, T., editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
Dias, M., Boné, J., Ferreira, J. C., Ribeiro, R., e Maia, R. (2020). Named entity recognition for sensitive data discovery in portuguese. Applied Sciences, 10(7).
Domingos, P. (2012). A few useful things to know about machine learning. Commun. ACM, 55(10):78–87.
Dwork, C. e Roth, A. (2014). The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci., 9(3–4):211–407.
Esteva, A., Kuprel, B., Novoa, R. A., Ko, J., Swetter, S. M., Blau, H. M., e Thrun, S. (2017). Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118.
Hochreiter, S. e Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735–1780.
Huang, Z., Xu, W., e Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., e Dyer, C. (2016). Neural architectures for named entity recognition.
LeCun, Y., Bengio, Y., e Hinton, G. (2015). Deep learning. Nature, 521(7553):436–444.
Lehman, E., Jain, S., Pichotta, K., Goldberg, Y., e Wallace, B. C. (2021). Does bert pretrained on clinical notes reveal sensitive data? Manning, C. D., Raghavan, P., e Schütze, H. (2008). Introduction to information retrieval.
Muralitharan, J. e Arumugam, C. (2024). Privacy bert-lstm: a novel nlp algorithm for sensitive information detection in textual documents. Neural Computing and Applications, 36(25):15439–15454.
Ohm, P. (2010). Broken promises of privacy: Responding to the surprising failure of anonymization. UCLA Law Review, 57:1701–1777. University of Colorado Law Legal Studies Research Paper No. 9-12.
OpenAI (2024). Gpt-4o mini: advancing cost-efficient intelligence. [link]. Acessado em: 25 maio 2025.
Powers, D. M. W. (2020). Evaluation: from precision, recall and f-measure to roc, informedness, markedness and correlation.
Schuster, M. e Paliwal, K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681.
Souza, F., Nogueira, R., e Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).
Souza, S., Matos, H., Costa, C., Filho, R. S., e Costa, J. (2022). Data mining in public security databases in belém, pará, brazil. In Anais da II Escola Regional de Alto Desempenho Norte 2 e II Escola Regional de Aprendizado de Máquina e Inteligência Artificial Norte 2, pages 33–36, Porto Alegre, RS, Brasil. SBC.
Topol, E. J. (2019). High-performance medicine: the convergence of human and artificial intelligence. Nature Medicine, 25(1):44–56.
Union, E. (2016). General data protection regulation. Regulation (EU) 2016/679.
Wang, S., Sun, X., Li, X., Ouyang, R., Wu, F., Zhang, T., Li, J., e Wang, G. (2023). Gpt-ner: Named entity recognition via large language models.
Wang, X., Wang, Z., Han, X., Jiang, W., Han, R., Liu, Z., Li, J., Li, P., Lin, Y., e Zhou, J. (2020). MAVEN: A Massive General Domain Event Detection Dataset. In Webber, B., Cohn, T., He, Y., e Liu, Y., editors, Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1652–1671, Online. Association for Computational Linguistics.
Yayık, A., Apik, H., e Tosun, A. (2021). Deep learning based topic classification for sensitivity assignment to personal data. In 2021 6th International Conference on Computer Science and Engineering (UBMK), pages 292–297.
Yermilov, O., Raheja, V., e Chernodub, A. (2023). Privacy-and utility-preserving nlp with anonymized data: A case study of pseudonymization.
