Evaluation of Named Entity Recognition using Ensemble in Transformers Models for Brazilian Public Texts

Eutino Júnior Vieira Sirqueira; Flávio de Barros Vidal

doi:10.5753/eniac.2024.245227

Eutino Júnior Vieira Sirqueira UnB / IFPI
Flávio de Barros Vidal UnB

DOI: https://doi.org/10.5753/eniac.2024.245227

Resumo

Natural Language Processing (NLP) has experienced significant advances, driven mainly by developing deep learning models using Transformers. In the Brazilian context, the analysis of open data, such as official documents published in the Official Federal Gazette (DOU), is crucial for transparency and access to information. In this work, we propose an evaluation of ensemble models, using Transformers models, applied for the Named Entity Recognition (NER) task in Brazilian Public Texts. The proposed evaluation tested a set of models based on the Bidirectional Encoder Representations from Transformers (BERT) model variations and combinations of ensemble strategies, reaching improvements of up to 11% in the proposed corpus when compared with classic NER approaches using only BERT-based models.

Palavras-chave: Named-Entity-Recognition, Transformers-Models, Ensemble-Learning, Brazilian-Public-Texts

Referências

Albanaz, J. O. L. (2020). Reconhecimento de entidades nomeadas em resultados de licitações publicados em diários oficiais.

Alles, V. J., Giozza, W. F., and de Oliveira Alburquerque, R. (2018). Natural language processing to classify named entities of the brazilian union official diary.

Belém, F. M., Ganem, M., França, C., Carvalho, M., Laender, A. H. F., and Gonçalves, M. A. (2022). Reforço e delimitação contextual para reconhecimento de entidades e relações em documentos oficiais.

Clark, K., Luong, M.-T., Le, Q. V., and Manning, C. D. (2020). Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555.

Conneau, A., Khandelwal, K., Goyal, N., Chaudhary, V., Wenzek, G., Guzmán, F., Grave, E., Ott, M., Zettlemoyer, L., and Stoyanov, V. (2019). Unsupervised cross-lingual representation learning at scale. arXiv preprint arXiv:1911.02116.

da Silva, M. G. (2022). Reconhecimento de entidades nomeadas em documentos de editais de compras utilizando aprendizado profundo.

Dalianis, H. and Dalianis, H. (2018). Evaluation metrics and evaluation. Clinical Text Mining: secondary use of electronic patient records, pages 45–53.

de Araujo, P. H. L., de Campos, T., de Oliveira, R. R. R., Stauffer, M., Couto, S., and de Souza Bermejo, P. H. (2018). Lener-br: A dataset for named entity recognition in brazilian legal text.

de Carvalho, L. R., Mendes, F. L., Chaves, J., Lima, M. C., de Deus, F. E. G., Araújo, A. P., and de Barros Vidal, F. (2022). Deep-vacuity: A proposal of a machine learning platform based on high-performance computing architecture for insights on government of brazil official gazettes. In WEBIST, pages 136–143.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Domingues, M. (2022). Language model in the legal domain in portuguese. [link].

Guimarães, G. M. C., Silva, F. M., Queiroz, A. L., Marcacini, R. M., Faleiros, T., Borges, V. R. P., and García, L. (2024). Dodfminer: An automated tool for named entity recognition from official gazettes.

Honnibal, M., Montani, I., Van Landeghem, S., and Boyd, A. (2020). spaCy: Industrial-strength natural language processing in python.

Khan, A. A., Chaudhari, O., and Chandra, R. (2024). A review of ensemble learning and data augmentation models for class imbalanced problems: Combination, implementation and evaluation.

Kuncheva, L. I. (2014). Combining pattern classifiers: methods and algorithms. John Wiley & Sons.

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., and Soricut, R. (2019). Albert: A lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. (2019). Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

Li, J., Sun, A., Han, J., and Li, C. (2022). A survey on deep learning for named entity recognition.

Possamai, A. J. and de Souza, V. G. (2020). Transparência e dados abertos governamentais: Possibilidades e desafios a partir da lei de acesso À informação.

Rodríguez, M. M. and Bezerra, B. L. D. (2020). Processamento de linguagem natural para reconhecimento de entidades nomeadas em textos jurídicos de atos administrativos (portarias).

Rouhizadeh, H. and Teodoro, D. (2022). Ds4dh at semeval-2022 task 11: Multilingual named entity recognition using an ensemble of transformer-based language models.

Sagi, O. and Rokach, L. (2018). Ensemble learning: A survey. Wiley-Blackwell, 8(4).

Silva, F. M., Guimarães, G., Rezende, S. O., Queiroz, A. L., Borges, V. R. P., Faleiros, T., and García, L. (2022). Named entity recognition approaches applied to legal document segmentation.

Singh, A., Singh, S. S., and Tiwary, U. S. (2023). Enhancing hindi named entity recognition through ensemble learning.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).

Sun, J., Tang, R., Xiang, L., Zhai, F., and Zhou, Y. (2021). Multi-strategy fusion for medical named entity recognition.

Tay, Y., Dehghani, M., Bahri, D., and Metzler, D. (2022). Efficient transformers: A survey.

Wang, Z., Wu, Y., Lei, P., and Cheng, P. (2020). Named entity recognition method of brazilian legal text based on pre-training model.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.

Zheng, J. and Sun, J. (2023). Ensembles of bert models for ancient chinese processing.