Exploratory Analysis of Textual Attributes in Databases for Identification of Sensitive Fields

  • Bruno H. Labres UFPR
  • André Grégio UFPR
  • Fabiano Silva UFPR

Abstract


The imminent implantation of the Brazilian General Law for the Protection of Personal Data urges the creation of automated techniques in database anonymization. The existing tools are dependent of a specialist to manually adjust the data. In this work, we propose to apply classification algorithms to attributes commonly found in databases. We hope to improve the automated classification of database attributes, where it can be used in the development of new softwares or as a component used before the anonymization process. The experimental evaluation of the proposed digram frequency representation shows that it is possible to obtain simple machine learning models, such as random forest and neural network, capable of classifying people's names, addresses and textual descriptions reaching 97% of accuracy and using 676 features.

References

Ç. Çöltekin and T. Rama. Drug-use identification from tweets with word and character n-grams. In Proceedings of the 2018 EMNLP Workshop SMM4H: The 3rd Social Media Mining for Health Applications Workshop & Shared Task, pages 52–53, Brussels, Belgium, Oct. 2018. Association for Computational Linguistics. doi:10.18653/v1/W18-5914. URL https://aclanthology.org/W18-5914.

S. Daneshvar and D. Inkpen. Gender identification in twitter using n-grams and lsa: Notebook for pan at clef 2018. In CLEF, 2018.

K. El Emam and F. K. Dankar. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association, 15(5):627–637, 2008.

D. . C. L. . M. B. . V. L. . M. E. . S. M. Grégio, A. R. A. ; Aleo. Monitoramento Remoto e Georreferenciamento de Tecnologias para Saúde. In: Fotini Santos Toscas; Maria Helenice de Castro. (Org.). Avanços, Desafios e Oportunidades no Complexo Industrial da Saúde em Serviços Tecnológicos. MS, 2018.

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.

J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, USA, 2nd edition, 2014. ISBN 1107077230.

B. Malle, P. Kieseberg, and A. Holzinger. Interactive anonymization for privacy aware machine learning. 11 2017.

S. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation - J DOC, 60:503–520, 10 2004. doi: 10.1108/00220410410560582.

A. Tveit, O. Edsberg, T. Røst, A. Faxvaag, Nytrø, T. Nordgard, M. Ranang, and A. Grimsmo. Anonymization of general practioner medical records. 01 2004.
Published
2021-10-04
LABRES, Bruno H.; GRÉGIO, André; SILVA, Fabiano. Exploratory Analysis of Textual Attributes in Databases for Identification of Sensitive Fields. In: WORKSHOP ON SCIENTIFIC INITIATION AND UNDERGRADUATE WORKS - BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 21. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 98-109. DOI: https://doi.org/10.5753/sbseg_estendido.2021.17365.

Most read articles by the same author(s)

1 2 3 4 > >>