Classificação de documentos sensíveis da administração pública utilizando CBIR
Abstract
Public organizations face difficulties in classifying and promoting transparency of their documents. Correct classification is critical to prevent public access to sensitive information and protect individuals and organizations from malicious use. This paper presents an ongoing research that proposes approaches to perform the task of classifying sensitive documents using machine learning techniques. Real data from the Electronic Information System (SEI) of UFG was used, and preliminary results demonstrate the potential and viability of the project, having already achieved an accuracy of 87% in the classification of public documents.
References
Brasil (2011). Lei nº 12.527, de 18 de novembro de 2011. Diário Oficial da República Federativa do Brasil.
Brasil (2018). Lei nº 13.709, de 14 de agosto de 2018. Diário Oficial da República Federativa do Brasil.
Costa, R., Junior, E., Nunes, F., Oliveira, L., and Salvini, R. (2015). Analysis of techniques of the content-based image retrieval to construct an information system of the computer-aided diagnosis.
Geetha, R., Karthika, S., and Kumaraguru, P. (2022). ‘Will I regret for this tweet?’Twitter user’s behavior analysis system for private data disclosure. The Computer Journal, 65(2):275–296.
Kobayashi, V. B., Mol, S. T., Berkers, H. A., Kismihok, G., and Den Hartog, D. N. (2018). Text classification for organizational researchers: A tutorial. Organizational research methods, 21(3):766–799.
McDonald, G., Macdonald, C., and Ounis, I. (2015). Using part-of-speech n-grams for sensitive-text classification. In Proceedings of the 2015 International conference on the theory of information retrieval, pages 381–384.
Neerbek, J., Assent, I., and Dolog, P. (2018). Detecting complex sensitive information via phrase structure in recursive neural networks. In Advances in Knowledge Discovery and Data Mining: 22nd Pacific-Asia Conference, PAKDD 2018, Melbourne, VIC, Australia, June 3-6, 2018, Proceedings, Part III 22, pages 373–385. Springer.
Occhipinti, A., Rogers, L., and Angione, C. (2022). A pipeline and comparative study of 12 machine learning models for text classification. Expert Systems with Applications, 201:117193.
Ougiaroglou, S. and Evangelidis, G. (2015). Dealing with noisy data in the context of k-NN classification. In Proceedings of the 7th Balkan Conference on Informatics Conference, pages 1–4.
Sousa, S. and Kern, R. (2023). How to keep text private? A systematic review of deep learning methods for privacy-preserving natural language processing. Artificial Intelligence Review, 56(2):1427–1492.
UFG, A. (2023). Painel de indicadores do SEI-UFG. Acessado em junho de 2023.
Zhai, H. (2022). Improving KNN algorithm efficiency based on PCA and KD-tree. In 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), pages 83–87. IEEE.
