A comparative analysis of text embedding approach to extract named entities in Portuguese legal documents

Hyan H. N. Batista; André C. A. Nascimento; Rafael Ferreira Melo; Péricles B. C. Miranda; Isabel W. S. Maldonado; José L. M. Coelho Filho

doi:10.5753/eniac.2021.18255

Hyan H. N. Batista UFRPE
André C. A. Nascimento UFRPE
Rafael Ferreira Melo UFRPE
Péricles B. C. Miranda UFRPE
Isabel W. S. Maldonado NESS Law
José L. M. Coelho Filho NESS Law

DOI: https://doi.org/10.5753/eniac.2021.18255

Resumo

A petição inicial é um dos componentes mais importantes de um processo civil, de modo que a automatização da análise desses documentos pode diminuir o tempo necessário para que se cumpra a fase postulatória. O corpo de qualificação das partes, por sua vez, é a seção nesse documento onde são expostas as informações a respeito das entidades envolvidas no processo. Este artigo propõe o uso de técnicas de extração de entidades nomeadas no problema de identificação e extração de informações em petições iniciais. Para tal, foi produzida uma base de dados composta por corpos de qualificação das partes de petições iniciais extraídas de processos advindos de tribunais brasileiros. Foram treinados, avaliados e comparados sete modelos BiLSTM-CRF com combinações distintas de representações vetoriais de palavras, a fim de se investigar seus efeitos na performance de um algoritmo com essa arquitetura e, dessa forma, aprimorar o reconhecimento de entidades jurídicas em textos legais. Ao contrário de outros trabalhos baseados em BiLSTM-CRF para tarefas de NER no domínio jurídico, esta pesquisa dá ênfase não às arquiteturas empregadas, mas sim aos métodos de representação de texto usados. Os experimentos executados com o corpus desenvolvido mostram que o empilhamento de incorporações de caracteres, palavras e pooled FLAIR embeddings é a combinação preferível para extrair-se o melhor desempenho possível de modelos híbridos BiLSTM-CRF.

Referências

Akbik, A., Bergmann, T., and Vollgraf, R. (2019). Pooled contextualized embeddings for named entity recognition. In NAACL 2019, 2019 Annual Conference of the North American Chapter of the Association for Computational Linguistics, page 724–728.

Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649.

Deng, L. and Liu, Y. (2018). Deep learning in natural language processing. Springer.

Du, L., Li, X., Liu, C., Liu, R., Fan, X., Yang, J., Lin, D., and Wei, M. (2016). Chinese word segmentation based on conditional random fields with character clustering. In 2016 International Conference on Asian Language Processing (IALP), pages 258–261. IEEE.

Gaikwad, V. and Haribhakta, Y. (2020). Adaptive glove and fasttext model for hindi word embeddings. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pages 175–179.

Giorgi, J. M. and Bader, G. D. (2020). Towards reliable named entity recognition in the biomedical domain. Bioinformatics, 36(1):280–286.

Hong, S. and Lee, J.-G. (2020). Dtranner: biomedical named entity recognition with deep learning-based label-label transition model. BMC bioinformatics, 21(1):1–11.

Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.

Khatri, A. et al. (2020). Sarcasm detection in tweets with bert and glove embeddings. arXiv preprint arXiv:2006.11512.

Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 Proceedings of the Conference, pages 260–270.

Leitner, E., Rehm, G., and Moreno-Schneider, J. (2020). A dataset of German legal documents for named entity recognition. LREC 2020 12th International Conference on Language Resources and Evaluation, Conference Proceedings, (Section 5):4478– 4485.

Li, J., Sun, A., Han, J., and Li, C. (2020a). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering.

Li, J., Sun, A., Han, J., and Li, C. (2020b). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, pages 1–1.

Luz de Araujo, P. H., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11122 LNAI:313–323.

Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnnscrf. arXiv preprint arXiv:1603.01354.

Mendonça Jr, C., Barbosa, L. A., Macedo, H. T., and São Cristóvão, S. (2016). Uma arquitetura híbrida lstm-cnn para reconhecimento de entidades nomeadas em textos naturais em língua portuguesa. XIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC). SBC.

Menezes, D. S., Milidiú, R. L., and Savarese, P. (2019). Building a massive corpus for named entity recognition using free open data sources. Proceedings 2019 Brazilian Conference on Intelligent Systems, BRACIS 2019, pages 6–11.

Mohit, B. (2014). Named entity recognition. In Natural language processing of semitic languages, pages 221–245. Springer.

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.

Reimers, N., Eckle-Kohler, J., Schnober, C., Kim, J., and Gurevych, I. (2014). Germeval2014: Nested named entity recognition with neural networks.

Ritter, A., Clark, S., Etzioni, O., et al. (2011). Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 conference on empirical methods in natural language processing, pages 1524–1534.

Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, S., and Sedlmair, M. (2018). More than bags of words: Sentiment analysis with word embeddings. Communication Methods and Measures, 12(2-3):140–157.

Sousa, A. W. and Del Fabro, M. D. (2019). Iudicium textum dataset uma base de textos jurdicos para nlp. In Brazilian Symposium on Databases, pages 1–11.

Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P., and Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87:12–20.

Wang, Z., Wu, Y., Lei, P., and Peng, C. (2020). Named entity recognition method of brazilian legal text based on pre-training model. In Journal of Physics: Conference Series, volume 1550, page 032149. IOP Publishing.

Yadav, V. and Bethard, S. (2019). A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. arXiv.

A comparative analysis of text embedding approach to extract named entities in Portuguese legal documents

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)