Evaluation of a Hybrid Approach to Legal Entity Recognition

Fernando Hurias Lopes; Luís Paulo Faina Garcia

doi:10.5753/eniac.2025.14243

Fernando Hurias Lopes UnB
Luís Paulo Faina Garcia UnB

DOI: https://doi.org/10.5753/eniac.2025.14243

Resumo

The Brazilian judiciary system’s extensive volume of documents and technical language necessitates efficient methods for automating legal text analysis, where Legal Entity Recognition (LER) presents a significant challenge. This study evaluated the performance of LER models within the Brazilian legal domain through a comprehensive assessment across all publicly available Portuguese legal datasets: LeNER-BR, CDJur-BR, and UlyssesNER-BR. Eleven models were evaluated, encompassing classical-based, transformer-based, and hybrid approaches. Using Precision, Recall, and F1-Score metrics, the evaluation indicated that hybrid approaches consistently outperform both classicalbased and standalone transformer-based approaches in legal entity extraction tasks.

Referências

Albuquerque, H. O., Costa, R., Silvestre, G., Souza, E., Da Silva, N. F. F., Vitório, D., Moriyama, G., Martins, L., Soezima, L., Nunes, A., Siqueira, F., Tarrega, J. P., Beinotti, J. V., Dias, M., Silva, M., Gardini, M., Silva, V., De Carvalho, A. C. P. L. F., and Oliveira, A. L. I. (2022). UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition. In Computational Processing of the Portuguese Language. Springer.

Brito, M., Pinheiro, V., Furtado, V., Neto, J. A. M., Bomfim, F. d. C. J., da Costa, A. C. F., Silveira, R., and Aragão, N. (2023). CDJUR-BR – A Golden Collection of Legal Document from Brazilian Justice with Fine-Grained Named Entities. arXiv.

Costa, R., Albuquerque, H. O., Silvestre, G., Silva, N. F. F., Souza, E., Vitório, D., Nunes, A., Siqueira, F., Pedro Tarrega, J., Vitor Beinotti, J., de Souza Dias, M., Pereira, F. S. F., Silva, M., Gardini, M., Silva, V., de Carvalho, A. C. P. L. F., and Oliveira, A. L. I. (2022). Expanding UlyssesNER-Br Named Entity Recognition Corpus with Informal User-Generated Text. In Progress in Artificial Intelligence. Springer.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Machine Learning Research.

Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv.

Kudo, T. and Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics.

Lafferty, J., McCallum, A., and Pereira, F. (2001). Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proceedings of the 18th International Conference on Machine Learning.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. In Human Language Technologies. Association for Computational Lingustics.

Li, J., Sun, A., Han, J., and Li, C. (2020). A Survey on Deep Learning for Named Entity Recognition. arXiv.

Luz De Araujo, P. H., De Campos, T. E., De Oliveira, R. R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text. In Computational Processing of the Portuguese Language. Springer International Publishing.

Nunes, R. O., Balreira, D. G., Spritzer, A. S., and Freitas, C. M. D. S. (2024). A Named Entity Recognition Approach for Portuguese Legislative Texts Using Self-Learning. In Proceedings of the 16th International Conference on Computational Processing of Portuguese. Association for Computational Lingustics.

Ramshaw, L. and Marcus, M. (1995). Text chunking using transformation-based learning. arXiv.

Schuster, M. and Nakajima, K. (2012). Japanese and korean voice search. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

Silveira, R., Ponte, C., Almeida, V., Pinheiro, V., and Furtado, V. (2023). LegalBert-pt: A Pretrained Language Model for the Brazilian Portuguese Legal Domain. In Intelligent Systems. Springer Nature Switzerland.

Siqueira, D. P., Mendes Junior, F., and Santos, M. F. D. (2023). Poder judiciário na era digital: o impacto das novas tecnologias de informação e de comunicação no exercício da jurisdição. Consinter de Direito.

Souza, F., Nogueira, R., and Lotufo, R. (2020a). BERTimbau: Pretrained BERT Models for Brazilian Portuguese. In Intelligent Systems. Springer International Publishing.

Souza, F., Nogueira, R., and Lotufo, R. (2020b). Portuguese named entity recognition using bert-crf. arXiv.

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need.

Yadav, V. and Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. arXiv.

Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., and Le, Q. V. (2019). Xlnet: Generalized autoregressive pretraining for language understanding. arXiv.