Exploratory Analysis of Textual Attributes in Databases for Identification of Sensitive Fields
Abstract
The imminent implantation of the Brazilian General Law for the Protection of Personal Data urges the creation of automated techniques in database anonymization. The existing tools are dependent of a specialist to manually adjust the data. In this work, we propose to apply classification algorithms to attributes commonly found in databases. We hope to improve the automated classification of database attributes, where it can be used in the development of new softwares or as a component used before the anonymization process. The experimental evaluation of the proposed digram frequency representation shows that it is possible to obtain simple machine learning models, such as random forest and neural network, capable of classifying people's names, addresses and textual descriptions reaching 97% of accuracy and using 676 features.
References
S. Daneshvar and D. Inkpen. Gender identification in twitter using n-grams and lsa: Notebook for pan at clef 2018. In CLEF, 2018.
K. El Emam and F. K. Dankar. Protecting privacy using k-anonymity. Journal of the American Medical Informatics Association, 15(5):627–637, 2008.
D. . C. L. . M. B. . V. L. . M. E. . S. M. Grégio, A. R. A. ; Aleo. Monitoramento Remoto e Georreferenciamento de Tecnologias para Saúde. In: Fotini Santos Toscas; Maria Helenice de Castro. (Org.). Avanços, Desafios e Oportunidades no Complexo Industrial da Saúde em Serviços Tecnológicos. MS, 2018.
D. P. Kingma and J. Ba. Adam: A method for stochastic optimization, 2017.
J. Leskovec, A. Rajaraman, and J. D. Ullman. Mining of Massive Datasets. Cambridge University Press, USA, 2nd edition, 2014. ISBN 1107077230.
B. Malle, P. Kieseberg, and A. Holzinger. Interactive anonymization for privacy aware machine learning. 11 2017.
S. Robertson. Understanding inverse document frequency: On theoretical arguments for idf. Journal of Documentation - J DOC, 60:503–520, 10 2004. doi: 10.1108/00220410410560582.
A. Tveit, O. Edsberg, T. Røst, A. Faxvaag, Nytrø, T. Nordgard, M. Ranang, and A. Grimsmo. Anonymization of general practioner medical records. 01 2004.
