A benchmarking for public information by Machine Learning and Regular Language

  • Fernando Antonio Dantas Gomes Pinto PUC-Rio
  • Jefferson de Barros Santos FGV
  • Sérgio Lifschitz PUC-Rio
  • Edward Hermann Haeusler PUC-Rio


Technologies such as Big Data and Transfer Learning have been attracting the interest of industry and academia over the last 15 years. The consequence of this is an almost unanimous preference for technological solutions that use statistical models. This technology is causing a revolution in the information extraction process. In this research, we question whether this technique is the best solution for extracting information from documents. We compare machine learning (ML) and rule-based approaches in the task of recognizing legal entities in the official gazette. We built an annotated dataset with 100 examples of legal documents and submitted this model to an evaluation in IBM Watson Knowledge Studio (WKS). We show that, in a scenario where documents follow a formal structure, rules-based information extraction systems still present themselves as low-cost, more uncomplicated, and more efficient solutions.

Palavras-chave: Benchmarking, Machine Learning, Regular Language


Aho, A. V., Lam, M. S., Sethi, R., and Ullman, J. D. (2006). Compilers: Principles, Techniques, and Tools (2nd Edition). Addison-Wesley Longman Publishing Co., Inc., USA.

Brasil (1998). Lei complementar nº 95, de 26 de fevereiro de 1998. Diário Oficial [da] República Federativa do Brasil.

Brasil (2001). Lei complementar nº 107, de 26 de abril de 2001. Diário Oficial [da] República Federativa do Brasil.

Buil-Aranda, C., Hogan, A., Umbrich, J., and Vandenbussche, P.-Y. (2013). Sparql web-querying infrastructure: Ready for action? In Alani, H., Kagal, L., Fokoue, A., Groth, P., Biemann, C., Parreira, J. X., Aroyo, L., Noy, N., Welty, C., and Janowicz, K., editors, The Semantic Web – ISWC 2013, pages 277–293, Berlin, Heidelberg. Springer Berlin Heidelberg.

Chiticariu, L., Li, Y., and Reiss, F. R. (2013). Rule-based information extraction is dead! long live rule-based information extraction systems! In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 827–832, Seattle, Washington, USA. Association for Computational Linguistics.

Constantino, K., Cruz, V. A. L., Zucheratto, O. M. M., França, C., Carvalho, M., Silva, T. H. P., Laender, A. H. F., and Gonçalves, M. A. Segmentação e classificação semântica de trechos de diários oficiais usando aprendizado ativo. In Anais do XXXVII Simpósio Brasileiro de Banco de Dados (SBBD 2022), pages 304–316. Sociedade Brasileira de Computação SBC.

Friedman, C., Rindflesch, T. C., and Corn, M. (2013). Natural language processing: State of the art and prospects for significant progress, a workshop sponsored by the national library of medicine. Journal of Biomedical Informatics, 46(5):765–773.

Haeusler, E. H. and Rademaker, A. On how kelsenian jurisprudence and intuitionistic logic help to avoid contrary-to-duty paradoxes in legal ontologies, pages 44–59. Lógica no Avião.

Junior, R., Melo, W., Fagundes, R., and Maciel, A. (2018). Extração de informação e mineração de dados no diário oficial de pernambuco. Revista de Engenharia e Pesquisa Aplicada, 3.

Kelsen, H. (2009). Teoria pura do direito. WMF Martins Fontes, São Paulo, 8. ed. edition. ISBN: 83-336-0836-5.

Loper, E. and Bird, S. (2002). Nltk: The natural language toolkit. In In Proceedings of the ACL Workshop on Effective Tools and Methodologies for Teaching Natural Language Processing and Computational Linguistics. Philadelphia: Association for Computational Linguistics.

Mohit, B. (2014). Named entity recognition. In Zitouni, I., editor, Natural Language Processing of Semitic Languages, pages 221–245. Springer Berlin Heidelberg.

Pinto, F. A., Haeusler, E., and Lifschitz, S. (2021). Transparência pública automatizada a partir da gramática do diário oficial. In Anais do IX Workshop de Computação Aplicada em Governo Eletrônico, pages 59–70, Porto Alegre, RS, Brasil. SBC.

Pinto, F. A. D. G., Lifschitz, S., and Haeusler, E. H. (2022). A graph knowledge-base for auditing human resources public management. In Anais do X Workshop de Computação Aplicada em Governo Eletrônico, Porto Alegre, RS, Brasil. SBC.

Rodríguez, M., Dantas Bezerra, B (2019). Processamento de linguagem natural para reconhecimento de entidades nomeadas em textos jurídicos de atos administrativos (portarias). Revista de Engenharia e Pesquisa Aplicada, 5(1):67–77.
Como Citar

Selecione um Formato
PINTO, Fernando Antonio Dantas Gomes; SANTOS, Jefferson de Barros; LIFSCHITZ, Sérgio; HAEUSLER, Edward Hermann. A benchmarking for public information by Machine Learning and Regular Language. In: WORKSHOP DE COMPUTAÇÃO APLICADA EM GOVERNO ELETRÔNICO (WCGE), 11. , 2023, João Pessoa/PB. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 60-71. ISSN 2763-8723. DOI: https://doi.org/10.5753/wcge.2023.229975.

Artigos mais lidos do(s) mesmo(s) autor(es)