CDJUR-BR - Uma Coleção Dourada do Judiciário Brasileiro com Entidades Nomeadas Refinadas

Maurício Brito; Vládia Pinheiro; Vasco Furtado; João Araújo Monteiro Neto; Francisco das Chagas Jucá Bomfim; André Câmara Ferreira da Costa; Raquel Silveira

doi:10.5753/stil.2023.234217

Maurício Brito UNIFOR http://orcid.org/0009-0004-3639-8257
Vládia Pinheiro UNIFOR https://orcid.org/0000-0002-9851-8304
Vasco Furtado UNIFOR / ETICE https://orcid.org/0000-0001-8721-4308
João Araújo Monteiro Neto UNIFOR https://orcid.org/0000-0002-0690-2449
Francisco das Chagas Jucá Bomfim UNIFOR http://orcid.org/0000-0001-6160-7832
André Câmara Ferreira da Costa UNIFOR / Centro Universitário Christus https://orcid.org/0000-0001-8465-7031
Raquel Silveira IFCE https://orcid.org/0000-0001-7445-605X

DOI: https://doi.org/10.5753/stil.2023.234217

Resumo

Este artigo apresenta o desenvolvimento da Coleção Dourada do Judiciário Brasileiro (CDJUR-BR), um corpus formado por 21 entidades específicas anotadas em documentos jurídicos. A CDJUR-BR visa fornecer um corpus abrangente e robusto para REN, composto por 44.526 anotações. Além disso, foi desenvolvido um modelo para REN baseado no BERT que alcançou a F1-macro media de 0,58. Estes resultados indiciaram a importância e a utilidade da CDJUR-BR.

Palavras-chave: Coleção Dourada, Anotação de Corpus, Reconhecimento de Entidades Nomeadas, Legal IA

Referências

Albuquerque, H. O., Costa, R., Silvestre, G., Souza, E., da Silva, N. F., Vitório, D., Moriyama, G., Martins, L., Soezima, L., Nunes, A., et al. (2022). Ulyssesner-br: a corpus of brazilian legislative documents for named entity recognition. In International Conference on Computational Processing of the Portuguese Language, pages 3–14. Springer.

Angelidis, I., Chalkidis, I., and Koubarakis, M. (2018). Named entity recognition, linking and generation for greek legislation. In JURIX, pages 1–10.

Atdaǧ, S. and Labatut, V. (2013). A comparison of named entity recognition tools applied to biographical texts. In 2nd International conference on systems and computer science, pages 228–233. IEEE.

Cejuela, J. M., McQuilton, P., Ponting, L., Marygold, S. J., Stefancsik, R., Millburn, G. H., Rost, B., Consortium, F., et al. (2014). tagtog: interactive and text-mining-assisted annotation of gene mentions in plos full-text articles. Database, 2014.

de Araujo, P. H. L., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). Lener-br: a dataset for named entity recognition in brazilian legal text. In International Conference on Computational Processing of the Portuguese Language, pages 313–323. Springer.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Graves, A. and Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6):602–610.

Hochreiter, S. and Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8):1735–1780.

Honnibal, M. and Montani, I. (2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear.

Hovy, E. and Lavid, J. (2010). Towards a ‘science’of corpus annotation: a new methodological challenge for corpus linguistics. International journal of translation, 22(1):13–36.

Jiang, R., Banchs, R. E., and Li, H. (2016). Evaluating and combining name entity recognition systems. In Proceedings of the Sixth Named Entity Workshop, pages 21–27.

Klie, J.-C., Bugert, M., Boullosa, B., de Castilho, R. E., and Gurevych, I. (2018). The inception platform: Machine-assisted and knowledge-oriented interactive annotation. In Proceedings of the 27th International Conference on Computational Linguistics: System Demonstrations, pages 5–9.

Lafferty, J., McCallum, A., and Pereira, F. C. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data.

Landis, J. R. and Koch, G. G. (1977). An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers. Biometrics, pages 363–374.

Leitner, E., Rehm, G., and Moreno-Schneider, J. (2020). A dataset of german legal documents for named entity recognition. arXiv preprint arXiv:2003.13016.

Li, J., Sun, A., Han, J., and Li, C. (2020). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, 34(1):50–70.

McHugh, M. L. (2012). Interrater reliability: the kappa statistic. Biochemia medica, 22(3):276–282.

Mikheev, A., Moens, M., and Grover, C. (1999). Named entity recognition without gazetteers. In Ninth Conference of the European Chapter of the Association for Computational Linguistics, pages 1–8.

Santos, D. and Cardoso, N. (2006). A golden resource for named entity recognition in portuguese. In International Workshop on Computational Processing of the Portuguese Language, pages 69–79. Springer.

Schmitt, X., Kubler, S., Robert, J., Papadakis, M., and LeTraon, Y. (2019). A replicable comparison study of ner software: Stanfordnlp, nltk, opennlp, spacy, gate. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS), pages 338–343. IEEE.

Silva, R. L. d., Hoch, P. A., and Righi, L. M. (2013). Transparˆencia pública e a atuação normativa do cnj. Revista direito GV, 9:489–514.

Yadav, V. and Bethard, S. (2019). A survey on recent advances in named entity recognition from deep learning models. arXiv preprint arXiv:1910.11470.