Automatic Classification of Access Levels in Documents from the Brazilian Electronic Information System
Resumo
The Brazilian Electronic Information System (SEI) is used to manage documents and administrative processes in public institutions. Users often need to upload external documents and manually assign an access level: public, restricted, or confidential. To reduce human errors in this process, we propose an automatic classification approach. A labeled corpus was constructed using public SEI’s documents and artificial samples generated from SEI’s form templates filled with fictional content using ChatGPT. We trained and evaluated four classification models: SVM, LSTM, BiLSTM, and BERT. Experiments were performed by means of Stratified K-Fold cross-validation and the results showed that SVM and BiLSTM performed best, each achieving a macro F1-score of 0.96.Referências
Aldeen, Y. A. A. S., Salleh, M., and Razzaque, M. A. (2015). A comprehensive review on privacy preserving data mining. SpringerPlus, 4(1):694.
Alparslan, E., Karahoca, A., and Bahşi, H. (2011). Classification of confidential documents by using adaptive neurofuzzy inference systems. Procedia Computer Science, 3:1412–1417. World Conference on Information Technology.
Bayer, M., Kaufhold, M.-A., and Reuter, C. (2022). A survey on data augmentation for text classification. ACM Computing Surveys, 55(7):1–39.
Cai, X., Xiao, M., Ning, Z., and Zhou, Y. (2023). Resolving the imbalance issue in hierarchical disciplinary topic inference via llm-based data augmentation. In 2023 IEEE International Conference on Data Mining Workshops (ICDMW), pages 1424–1429. IEEE.
Camarinha, A., Abreu, A., Júnior, M., and Cardoso, I. (2023). Users’ perception of satisfaction of the eletronic information system – sei in the instituto federal de rondônia. Journal of Information Systems Engineering and Management, 8:18354.
Cloutier, N. A. and Japkowicz, N. (2023). Fine-tuned generative llm oversampling can improve performance over traditional techniques on multiclass imbalanced text classification. In 2023 IEEE International Conference on Big Data (BigData), pages 5181–5186. IEEE.
Coelho, G. M., Ramos, A. C., de Sousa, J., Cavaliere, M., de Lima, M. J., Mangeth, A., Frajhof, I. Z., Cury, C., and Casanova, M. A. (2022). Text classification in the brazilian legal domain. In ICEIS (1), pages 355–363.
Csányi, G. M., Nagy, D., Vági, R., Vadász, J. P., and Orosz, T. (2021). Challenges and open problems of legal document anonymization. Symmetry, 13(8):1490.
De Araujo, P. H. L., de Campos, T. E., Braz, F. A., and da Silva, N. C. (2020). Victor: a dataset for brazilian legal documents classification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1449–1458.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186.
Ghann, P., Tetteh, E. D., Asare Obeng, K., and Elias, M. (2022). Preserving the privacy of sensitive data using bit-coded-sensitive algorithm (bcsa). International Journal of Recent Contributions from Engineering, Science amp; IT (iJES), 10(04):pp. 4–16.
Gottschalg-Duque, C. (2024). Towards the use of blockchain technology in sei, a brazilian electronic document and process management tool. In Anais do II Colóquio em Blockchain e Web Descentralizada, pages 26–31, Porto Alegre, RS, Brasil. SBC.
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer Berlin Heidelberg.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735–1780.
Juez-Hernandez, R., Quijano-Sánchez, L., Liberatore, F., and Gómez, J. (2023). Agora: An intelligent system for the anonymization, information extraction and automatic mapping of sensitive documents. Applied Soft Computing, 145:110540.
Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. International Conference on Learning Representations.
Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., and Wang, H. (2024). On llms-driven synthetic data generation, curation, and evaluation: A survey. arXiv preprint arXiv:2406.15126.
Sangaroonsilp, P., Choetkiertikul, M., Dam, H. K., and Ghose, A. (2023a). An empirical study of automated privacy requirements classification in issue reports. Automated Software Engineering, 30(2):20.
Sangaroonsilp, P., Dam, H. K., Choetkiertikul, M., Ragkhitwetsagul, C., and Ghose, A. (2023b). A taxonomy for mining and classifying privacy requirements in issue reports. Information and Software Technology, 157:107162.
Smith, R. (2007). An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), volume 2, pages 629–633. IEEE.
Sousa, S. and Kern, R. (2023). How to keep text private? a systematic review of deep learning methods for privacy-preserving natural language processing. Artificial Intelligence Review, 56(2):1427–1492.
Sulavko, A., Varkentin, Y., Panfilova, I., and Samotuga, A. (2024). Automatic classification of text messages by confidentiality level based on ensemble of artificial neural networks. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM), pages 484–492. IEEE.
Wu, J. M.-T., Srivastava, G., Jolfaei, A., Fournier-Viger, P., and Lin, J. C.-W. (2021). Hiding sensitive information in ehealth datasets. Future Generation Computer Systems, 117:169–180.
Alparslan, E., Karahoca, A., and Bahşi, H. (2011). Classification of confidential documents by using adaptive neurofuzzy inference systems. Procedia Computer Science, 3:1412–1417. World Conference on Information Technology.
Bayer, M., Kaufhold, M.-A., and Reuter, C. (2022). A survey on data augmentation for text classification. ACM Computing Surveys, 55(7):1–39.
Cai, X., Xiao, M., Ning, Z., and Zhou, Y. (2023). Resolving the imbalance issue in hierarchical disciplinary topic inference via llm-based data augmentation. In 2023 IEEE International Conference on Data Mining Workshops (ICDMW), pages 1424–1429. IEEE.
Camarinha, A., Abreu, A., Júnior, M., and Cardoso, I. (2023). Users’ perception of satisfaction of the eletronic information system – sei in the instituto federal de rondônia. Journal of Information Systems Engineering and Management, 8:18354.
Cloutier, N. A. and Japkowicz, N. (2023). Fine-tuned generative llm oversampling can improve performance over traditional techniques on multiclass imbalanced text classification. In 2023 IEEE International Conference on Big Data (BigData), pages 5181–5186. IEEE.
Coelho, G. M., Ramos, A. C., de Sousa, J., Cavaliere, M., de Lima, M. J., Mangeth, A., Frajhof, I. Z., Cury, C., and Casanova, M. A. (2022). Text classification in the brazilian legal domain. In ICEIS (1), pages 355–363.
Csányi, G. M., Nagy, D., Vági, R., Vadász, J. P., and Orosz, T. (2021). Challenges and open problems of legal document anonymization. Symmetry, 13(8):1490.
De Araujo, P. H. L., de Campos, T. E., Braz, F. A., and da Silva, N. C. (2020). Victor: a dataset for brazilian legal documents classification. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1449–1458.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186.
Ghann, P., Tetteh, E. D., Asare Obeng, K., and Elias, M. (2022). Preserving the privacy of sensitive data using bit-coded-sensitive algorithm (bcsa). International Journal of Recent Contributions from Engineering, Science amp; IT (iJES), 10(04):pp. 4–16.
Gottschalg-Duque, C. (2024). Towards the use of blockchain technology in sei, a brazilian electronic document and process management tool. In Anais do II Colóquio em Blockchain e Web Descentralizada, pages 26–31, Porto Alegre, RS, Brasil. SBC.
Graves, A. (2012). Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence. Springer Berlin Heidelberg.
Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8):1735–1780.
Juez-Hernandez, R., Quijano-Sánchez, L., Liberatore, F., and Gómez, J. (2023). Agora: An intelligent system for the anonymization, information extraction and automatic mapping of sensitive documents. Applied Soft Computing, 145:110540.
Kingma, D. and Ba, J. (2014). Adam: A method for stochastic optimization. International Conference on Learning Representations.
Long, L., Wang, R., Xiao, R., Zhao, J., Ding, X., Chen, G., and Wang, H. (2024). On llms-driven synthetic data generation, curation, and evaluation: A survey. arXiv preprint arXiv:2406.15126.
Sangaroonsilp, P., Choetkiertikul, M., Dam, H. K., and Ghose, A. (2023a). An empirical study of automated privacy requirements classification in issue reports. Automated Software Engineering, 30(2):20.
Sangaroonsilp, P., Dam, H. K., Choetkiertikul, M., Ragkhitwetsagul, C., and Ghose, A. (2023b). A taxonomy for mining and classifying privacy requirements in issue reports. Information and Software Technology, 157:107162.
Smith, R. (2007). An overview of the tesseract ocr engine. In Ninth International Conference on Document Analysis and Recognition (ICDAR 2007), volume 2, pages 629–633. IEEE.
Sousa, S. and Kern, R. (2023). How to keep text private? a systematic review of deep learning methods for privacy-preserving natural language processing. Artificial Intelligence Review, 56(2):1427–1492.
Sulavko, A., Varkentin, Y., Panfilova, I., and Samotuga, A. (2024). Automatic classification of text messages by confidentiality level based on ensemble of artificial neural networks. In 2024 2nd International Conference on Foundation and Large Language Models (FLLM), pages 484–492. IEEE.
Wu, J. M.-T., Srivastava, G., Jolfaei, A., Fournier-Viger, P., and Lin, J. C.-W. (2021). Hiding sensitive information in ehealth datasets. Future Generation Computer Systems, 117:169–180.
Publicado
20/07/2025
Como Citar
BORGES, Ana Clara B.; MARINHO, Mayara C.; NOGUEIRA, Rodrigo de Freitas; BORDIM, Jacir L.; BORGES, Vinicius R. P..
Automatic Classification of Access Levels in Documents from the Brazilian Electronic Information System. In: LATIN AMERICAN SYMPOSIUM ON DIGITAL GOVERNMENT (LASDIGOV), 12. , 2025, Maceió/AL.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 203-214.
ISSN 2763-8723.
DOI: https://doi.org/10.5753/lasdigov.2025.9117.
