A comparative analysis of text embedding approach to extract named entities in Portuguese legal documents
Abstract
The initial petition is one of the most important components of a civil litigation process. Automating the analysis of these documents might reduce the time necessary for the postulatory phase's conclusion. The parties qualification body is the section in which are exposed the informations about the entities involved in the process. This paper suggests the employment of named entity extraction techniques on the problem of information extraction and recognition on initial petitions. With this in mind, was created a part description body corpora extracted from Brazilian courts. Seven BiLSTM-CRF models with distinct combinations of vector representations of words were trained, evaluated, and compared to investigate their effects on the performance of an algorithm with that architecture and, in this way, improve the recognition of legal entities in legal texts. Unlike other works based on BiLSTM-CRF for NER tasks in the legal domain, this research emphasizes not the architectures employed, but rather the text representation methods used. The experiments performed with the developed corpus show that the stacking of character, word, and pooled FLAIR embeddings is the preferred combination to extract the best possible performance from BiLSTM-CRF hybrid models.
References
Akbik, A., Blythe, D., and Vollgraf, R. (2018). Contextual string embeddings for sequence labeling. In COLING 2018, 27th International Conference on Computational Linguistics, pages 1638–1649.
Deng, L. and Liu, Y. (2018). Deep learning in natural language processing. Springer.
Du, L., Li, X., Liu, C., Liu, R., Fan, X., Yang, J., Lin, D., and Wei, M. (2016). Chinese word segmentation based on conditional random fields with character clustering. In 2016 International Conference on Asian Language Processing (IALP), pages 258–261. IEEE.
Gaikwad, V. and Haribhakta, Y. (2020). Adaptive glove and fasttext model for hindi word embeddings. In Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, pages 175–179.
Giorgi, J. M. and Bader, G. D. (2020). Towards reliable named entity recognition in the biomedical domain. Bioinformatics, 36(1):280–286.
Hong, S. and Lee, J.-G. (2020). Dtranner: biomedical named entity recognition with deep learning-based label-label transition model. BMC bioinformatics, 21(1):1–11.
Huang, Z., Xu, W., and Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.
Khatri, A. et al. (2020). Sarcasm detection in tweets with bert and glove embeddings. arXiv preprint arXiv:2006.11512.
Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., and Dyer, C. (2016). Neural architectures for named entity recognition. 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL HLT 2016 Proceedings of the Conference, pages 260–270.
Leitner, E., Rehm, G., and Moreno-Schneider, J. (2020). A dataset of German legal documents for named entity recognition. LREC 2020 12th International Conference on Language Resources and Evaluation, Conference Proceedings, (Section 5):4478– 4485.
Li, J., Sun, A., Han, J., and Li, C. (2020a). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering.
Li, J., Sun, A., Han, J., and Li, C. (2020b). A survey on deep learning for named entity recognition. IEEE Transactions on Knowledge and Data Engineering, pages 1–1.
Luz de Araujo, P. H., de Campos, T. E., de Oliveira, R. R., Stauffer, M., Couto, S., and Bermejo, P. (2018). LeNER-Br: A Dataset for Named Entity Recognition in Brazilian Legal Text. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), 11122 LNAI:313–323.
Ma, X. and Hovy, E. (2016). End-to-end sequence labeling via bi-directional lstm-cnnscrf. arXiv preprint arXiv:1603.01354.
Mendonça Jr, C., Barbosa, L. A., Macedo, H. T., and São Cristóvão, S. (2016). Uma arquitetura híbrida lstm-cnn para reconhecimento de entidades nomeadas em textos naturais em língua portuguesa. XIII Encontro Nacional de Inteligência Artificial e Computacional (ENIAC). SBC.
Menezes, D. S., Milidiú, R. L., and Savarese, P. (2019). Building a massive corpus for named entity recognition using free open data sources. Proceedings 2019 Brazilian Conference on Intelligent Systems, BRACIS 2019, pages 6–11.
Mohit, B. (2014). Named entity recognition. In Natural language processing of semitic languages, pages 221–245. Springer.
Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532–1543.
Reimers, N., Eckle-Kohler, J., Schnober, C., Kim, J., and Gurevych, I. (2014). Germeval2014: Nested named entity recognition with neural networks.
Ritter, A., Clark, S., Etzioni, O., et al. (2011). Named entity recognition in tweets: an experimental study. In Proceedings of the 2011 conference on empirical methods in natural language processing, pages 1524–1534.
Rudkowsky, E., Haselmayer, M., Wastian, M., Jenny, M., Emrich, S., and Sedlmair, M. (2018). More than bags of words: Sentiment analysis with word embeddings. Communication Methods and Measures, 12(2-3):140–157.
Sousa, A. W. and Del Fabro, M. D. (2019). Iudicium textum dataset uma base de textos jurdicos para nlp. In Brazilian Symposium on Databases, pages 1–11.
Wang, Y., Liu, S., Afzal, N., Rastegar-Mojarad, M., Wang, L., Shen, F., Kingsbury, P., and Liu, H. (2018). A comparison of word embeddings for the biomedical natural language processing. Journal of biomedical informatics, 87:12–20.
Wang, Z., Wu, Y., Lei, P., and Peng, C. (2020). Named entity recognition method of brazilian legal text based on pre-training model. In Journal of Physics: Conference Series, volume 1550, page 032149. IOP Publishing.
Yadav, V. and Bethard, S. (2019). A Survey on Recent Advances in Named Entity Recognition from Deep Learning models. arXiv.
