Could you tell me the process ID? Structuring Text Documents from the Brazilian Electronic Information System Using a Named Entity Recognition Approach
Resumo
Context: In the context of valuing open data, transparency, and efficiency in public services, there is a growing demand for studies to improve government systems. Some public agencies use the Brazilian Electronic Information System (Sistema Eletrônico de Informações - SEI), a procedural management system that centralizes electronic processes and promotes administrative efficiency. Problem: Although SEI has contributed to advancements in public administration, significant challenges remain in information search and retrieval due to the inefficient keyword-based approach currently available. These difficulties are enhanced by the high amount of documents generated daily, which are written in natural language and present variability in categories, writing styles, and structures. As a result, searching for relevant documents in SEI is time-consuming, leading users to create unnecessary processes and inconsistent resolutions when compared to previously completed processes. Solution: A Natural Language Processing (NLP) pipeline was proposed to extract information from SEI documents using Named Entity Recognition (NER) models. IS Theory: Organizational Information Processing. Method: This research adopts a descriptive approach. Public SEI documents were collected using a web crawler, and trained annotators built a corpus to enable the training of state-of-the-art NER models. The models’ performances were compared and quantitatively analyzed. Summary of Results: A Brazilian Portuguese labeled corpus of SEI for NER was curated and validated, leading to an NLP pipeline for information extraction. Contributions and Impact in the IS area: This research provides a baseline for structuring data from Electronic Information Systems, enabling more effective strategies for search and retrieval tasks.
Referências
Hidelberg O. Albuquerque, Rosimeire Costa, Gabriel Silvestre, Ellen Souza, Nádia F. F. da Silva, Douglas Vitório, Gyovana Moriyama, Lucas Martins, Luiza Soezima, Augusto Nunes, Felipe Siqueira, João P. Tarrega, Joao V. Beinotti, Marcio Dias, Matheus Silva, Miguel Gardini, Vinicius Silva, André C. P. L. F. de Carvalho, and Adriano L. I. Oliveira. 2022. UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition. In Computational Processing of the Portuguese Language. Springer International Publishing, Cham, 3–14.
Indra Budi and Ryan Randy Suryono. 2023. Application of named entity recognition method for Indonesian datasets: a review. Bulletin of Electrical Engineering and Informatics 12, 2 (2023), 969–978. DOI: 10.11591/eei.v12i2.4529
Ana Camarinha, António Abreu, Marcelo Júnior, and Ivone Cardoso. 2023. Users’ perception of satisfaction of the eletronic information system – SEI in the Instituto Federal de Rondônia. Journal of Information Systems Engineering and Management 8 (January 2023), 18354. DOI: 10.55267/iadt.07.12744
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, and Sebastian et al. Gehrmann. 2024. PaLM: scaling language modeling with pathways. The Journal of Machine Learning Research 24, 1 (March 2024).
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805
Mariana Dias, João Boné, João C. Ferreira, Ricardo Ribeiro, and Rui Maia. 2020. Named Entity Recognition for Sensitive Data Discovery in Portuguese. Applied Sciences 10, 7 (2020). DOI: 10.3390/app10072303
Mateus Espadoto, Rafael M. Martins, Andreas Kerren, Nina S. T. Hirata, and Alexandru C. Telea. 2021. Toward a Quantitative Survey of Dimension Reduction Techniques. IEEE Transactions on Visualization and Computer Graphics 27, 3 (2021), 2153–2173. DOI: 10.1109/TVCG.2019.2944182
Tao Gui, Ruotian Ma, Qi Zhang, Lujun Zhao, Yu-Gang Jiang, and Xuanjing Huang. 2019. CNN-Based Chinese NER with Lexicon Rethinking. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19. International Joint Conferences on Artificial Intelligence Organization, 4982–4988. DOI: 10.24963/ijcai.2019/692
Gabriel M.C. Guimarães, Felipe X.B. da Silva, Andrei L. Queiroz, Ricardo M. Marcacini, Thiago P. Faleiros, Vinicius R.P. Borges, and Luís P.F. Garcia. 2024. DODFMiner: An automated tool for Named Entity Recognition from Official Gazettes. Neurocomputing 568 (2024), 127064. DOI: 10.1016/j.neucom.2023.127064
Clemens Hausmann, Yogesh K. Dwivedi, Krishna Venkitachalam, and Michael D. Williams. 2012. A Summary and Review of Galbraith’s Organizational Information Processing Theory. Vol. 2. Springer New York, New York, NY, 71–93. DOI: 10.1007/978-1-4419-9707-4_5
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation 9, 8 (1997). DOI: 10.1162/neco.1997.9.8.1735
Rodrigo Juez-Hernandez, Lara Quijano-Sánchez, Federico Liberatore, and Jesús Gómez. 2023. AGORA: An intelligent system for the anonymization, information extraction and automatic mapping of sensitive documents. Applied Soft Computing 145 (2023), 110540. DOI: 10.1016/j.asoc.2023.110540
Diederik Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. International Conference on Learning Representations (December 2014).
John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the Eighteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 282–289.
Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,W. Hubbard, and L. D. Jackel. 1989. Backpropagation Applied to Handwritten Zip Code Recognition. Neural Computation 1, 4 (1989), 541–551. DOI: 10.1162/neco.1989.1.4.541
Jing Li, Aixin Sun, Jianglei Han, and Chenliang Li. 2022. A Survey on Deep Learning for Named Entity Recognition. IEEE Transactions on Knowledge and Data Engineering 34, 1 (2022), 50–70. DOI: 10.1109/TKDE.2020.2981314
Zewen Li, Fan Liu, Wenjie Yang, Shouheng Peng, and Jun Zhou. 2022. A Survey of Convolutional Neural Networks: Analysis, Applications, and Prospects. IEEE Transactions on Neural Networks and Learning Systems 33, 12 (2022), 6999–7019. DOI: 10.1109/TNNLS.2021.3084827
Pedro Henrique Luz de Araujo, Teofilo de Campos, Renato Oliveira, Matheus Stauffer, Samuel Couto, and Paulo De Souza Bermejo. 2018. LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text: 13th International Conference, PROPOR 2018, Canela, Brazil, September 24–26, 2018, Proceedings. 313–323. DOI: 10.1007/978-3-319-99722-3_32
Gabriel M. C. Guimarães, Felipe X. B. da Silva, Lucas A. B. Macedo, Victor H. F. Lisboa, Ricardo M. Marcacini, Andrei L. Queiroz, Vinicius R. P. Borges, Thiago P. Faleiros, and Luis P. F. Garcia. 2024. Legal Document Segmentation and Labeling Through Named Entity Recognition Approaches. Journal of Information and Data Management 15, 1 (February 2024), 123–131. DOI: 10.5753/jidm.2024.3368
Vitor Oliveira, Gabriel Nogueira, Thiago Faleiros, and Ricardo Marcacini. 2024. Combining prompt-based language models and weak supervision for labeling named entity recognition on legal documents. Artificial Intelligence and Law (February 2024), 1–21. DOI: 10.1007/s10506-023-09388-1
OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, and Sam Altman et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL]
Aloir Pedruzzi Junior, Jonimar da Silva Souza, and Nubiana de Lima Irmão Pedruzzi. 2024. Sistema Eletrônico de Informações (SEI) como ferramenta para modernização da gestão documental na administração pública. Revista de Gestão e Secretariado 15, 1 (January 2024), 309–319. DOI: 10.7769/gesec.v15i1.3352
Diana Santos, Nuno Seco, Nuno Cardoso, and Rui Vilela. 2006. HAREM: An Advanced NER Evaluation Contest for Portuguese. In Proceedings of the Fifth International Conference on Language Resources and Evaluation (LREC’06). European Language Resources Association (ELRA), Genoa, Italy.
Yanyao Shen, Hyokun Yun, Zachary Lipton, Yakov Kronrod, and Animashree Anandkumar. 2017. Deep Active Learning for Named Entity Recognition. In Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, 252–256. DOI: 10.18653/v1/W17-2630
Priscilla Silva, Arthur Franco, Thiago Santos, Mozar José de Brito, and Denilson Pereira. 2023. CachacaNER: a dataset for named entity recognition in texts about the cachaça beverage. Language Resources and Evaluation (2023), 1–19.
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. 2020. BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Intelligent Systems, Ricardo Cerri and Ronaldo C. Prati (Eds.). Springer International Publishing, Cham, 403–417.
Charles Sutton and Andrew McCallum. 2012. An Introduction to Conditional Random Fields. Foundations and Trends in Machine Learning 4, 4 (April 2012), 267–373. DOI: 10.1561/2200000013
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. LLaMA: Open and Efficient Foundation Language Models. ArXiv abs/2302.13971 (2023).
Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research 9, 11 (2008).
Davide Varagnolo, Dora Melo, and Irene Pimenta Rodrigues. 2021. A Tool to Explore the Population of a CIDOC-CRM Ontology. Procedia Computer Science 192 (2021), 158–167. Knowledge-Based and Intelligent Information Engineering Systems: Proceedings of the 25th International Conference KES2021. DOI: 10.1016/j.procs.2021.08.017
Carol Luca Gasan Alexandru Ianov ă Corvin Ghit Vlad Silviu Coneschi Vasile Păis, , Maria Mitrofan and Andrei Onut, . 2023. LegalNERo: A linked corpus for named entity recognition in the Romanian legal domain. Miscellaneous 15, 3 (2023), 831–844. DOI: 10.3233/sw-233351
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17). Curran Associates Inc., 6000–6010.
Yuying Zhu and Guoxin Wang. 2019. CAN-NER: Convolutional Attention Network for Chinese Named Entity Recognition. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, Minneapolis, Minnesota, 3384–3393. DOI: 10.18653/v1/N19-1342