Segmentation and Semantic Classification of Official Documents

  • Kattiana Constantino Federal University of Minas Gerais (UFMG) http://orcid.org/0000-0003-4511-7504
  • Victor Augusto L. Cruz Federal University of Minas Gerais (UFMG)
  • Otávio M. M. Zucheratto Federal University of Minas Gerais (UFMG)
  • Celso França Federal University of Minas Gerais (UFMG)
  • Marcos Carvalho Federal University of Minas Gerais (UFMG)
  • Thiago H. P. Silva Federal University of Technology – Paraná (UTFPR)
  • Alberto H. F. Laender Federal University of Minas Gerais (UFMG)
  • Marcos André Gonçalves Federal University of Minas Gerais (UFMG)

Abstract


Unrestricted and monitorable access to laws and regulations is an essential presupposition of democracy. This allows, for example, the detection of illicit acts and the monitoring of fraud in public actions (e.g., bids). However, each federated entity follows its own criteria for standardizing models and format in making this information available, for example, in municipal, state and Union official journals. In this context, our objective is to minimize the effort to deal with the textual extraction of these essential data by proposing a structure-oriented heuristic to segment excerpts from public documents, notably official journals. Subsequently, we semantically classify the extracted snippets with an active learning strategy that minimizes manual labeling effort. As a result of these efforts, we developed an annotation prototype integrated into the classification process, achieving 100% accuracy in extraction and 85% in classification with very little labeling effort.

Keywords: segmentation, semantic classification, official documents, active learning

References

Blei, D. M. (2012). Probabilistic Topic Models. Communications of the ACM, 55(4):77-84.

Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet Allocation. Journal of Machine Learning Research, 3(Jan):993-1022.

Cunha, W., Mangaravite, V., Gomes, C., Canuto, S. D., Resende, E., Nascimento, C., Viegas, F., França, C., Martins, W. S., Almeida, J. M., Rosa, T., Rocha, L., and Gonçalves, M. A. (2021). On the Cost-Effectiveness of Neural and Non-Neural Approaches and Representations for Text Classification: A Comprehensive Comparative Study. Inf. Process. Manag., 58(3):102481.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, pages 4171-4186. Association for Computational Linguistics.

Feng, F., Yang, Y., Cer, D., Arivazhagan, N., and Wang, W. (2022). Language-agnostic BERT Sentence Embedding. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics, ACL, pages 878-891. Association for Computational Linguistics.

Garg, S., Vu, T., and Moschitti, A. (2020). TANDA: Transfer and Adapt Pre-Trained Transformer Models for Answer Sentence Selection. In The Thirty-Fourth AAAI Conference on Artificial Intelligence, AAAI, pages 7780-7788. AAAI Press.

Inuzuka, M., do Nascimento, H., Almeida, F., Barros, B., and Jradi, W. (2020). Doclass: open-source software to support document labeling and classification. In Anais do VIII Symposium on Knowledge Discovery, Mining and Learning, pages 105-112. SBC.

Lewis, D. D. and Catlett, J. (1994). Heterogeneous Uncertainty Sampling for Supervised Learning. In Machine Learning Proceedings 1994, pages 148-156. Elsevier.

Pak, I. and Teh, P. L. (2018). Text Segmentation Techniques: A Critical Review. Innovative Computing, Optimization and Its Applications, pages 167-181.

Pereira, G. C., Monteiro, I. T., Vasconcelos, D. R., Braz, L., and Silva, C. H. (2021). Classificação taxonômica de categorias de serviços públicos para aplicações digitais. In Anais do IX Workshop de Computação Aplicada em Governo Eletrônico, pages 119-130. SBC.

Pinto, F. A. D., Haeusler, E. H., and Lifschitz, S. (2021). Transparência pública automatizada a partir da gramática do diário oficial. In Anais do IX Workshop de Computação Aplicada em Governo Eletrônico, pages 59-70. SBC.

Rangel, M., Bernardini, F., Viterbo, J., Monteiro, R., Seixas, E., and dos Santos Pinto, H. (2020). Uso de Aprendizado de Máquina para Categorização Automática de Conjuntos de Dados de Portais de Dados Abertos. In Anais do VIII Workshop de Computação Aplicada em Governo Eletrônico, pages 120-131. SBC.

Rodrigues, R., da Silva, J., Castro, P., Félix, N., and Soares, A. (2019). Multilingual Transformer Ensembles for Portuguese Natural Language Tasks. In Proceedings of the ASSIN 2 Shared Task: Evaluating Semantic Textual Similarity and Textual Entailment in Portuguese co-located with XII Symposium in Information and Human Language Technology (STIL 2019), pages 27-38. CEUR-WS.org.

Santos, J., Consoli, B., dos Santos, C., Terra, J., Collonini, S., and Vieira, R. (2019). Assessing the Impact of Contextual Embeddings for Portuguese Named Entity Recognition. In Proceedings of the 8th Brazilian Conference on Intelligent Systems (BRACIS), pages 437-442. IEEE.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Proceedings of the 9th Brazilian Conference on Intelligent Systems, (BRACIS), pages 403-417. Springer.
Published
2022-09-19
CONSTANTINO, Kattiana; CRUZ, Victor Augusto L.; ZUCHERATTO, Otávio M. M.; FRANÇA, Celso; CARVALHO, Marcos; SILVA, Thiago H. P.; LAENDER, Alberto H. F.; GONÇALVES, Marcos André. Segmentation and Semantic Classification of Official Documents. In: BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 37. , 2022, Búzios. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 304-316. ISSN 2763-8979. DOI: https://doi.org/10.5753/sbbd.2022.224656.