Named Entity Recognition Approaches Applied to Legal Document Segmentation
Resumo
Document Segmentation is a method of dividing a document into smaller parts, known as segments, which share similarities that allow machines to distinguish between them. It might be useful to classify these segments, making it a problem with two steps: (I) the extraction of the segments; and (II) the annotation of these segments. The Named Entity Recognition problem's goal is to identify and classify entities within a text, having also to deal with those two questions: extraction and classification. In this study, we tackle the problem of Document Segmentation and the annotation of these segments through NER approaches, using CRF, CNN-CNN-LSTM and CNN-biLSTM-CRF models. The study is focused on Brazilian legal documents, proposing a data set of 127 annotated Portuguese texts from the Official Gazette of the Federal District, published between 2001 and 2015. The experiments were made using word-based and sentence-based models, with CRF sentence-based model showing the best results.
Referências
Barrow, J., Jain, R., Morariu, V., Manjunatha, V., Oard, D., and Resnik, P. A joint model for document segmentation and segment labeling. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, pp. 313–322, 2020.
Choi, F. Y. Y. Advances in domain independent linear text segmentation. In Proceedings of the 1st North American Chapter of the Association for Computational Linguistics Conference. NAACL 2000. Association for Computational Linguistics, USA, pp. 26–33, 2000.
Eisenstein, J. and Barzilay, R. Bayesian unsupervised topic segmentation. In Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Honolulu, Hawaii, pp. 334–343, 2008.
Fragkou, P. Use of named entity recognition and co-reference resolution tools for segmenting english texts. In Proceedings of the 19th Panhellenic Conference on Informatics. PCI ’15. Association for Computing Machinery, New York, NY, USA, pp. 331–336, 2015.
Glavaš, G., Nanni, F., and Ponzetto, S. P. Unsupervised text segmentation using semantic relatedness graphs. In Proceedings of the Fifth Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, Berlin, Germany, pp. 125–130, 2016.
Glavaš, G. and Somasundaran, S. Two-level transformer and auxiliary coherence modeling for improved text segmentation. Proceedings of the AAAI Conference on Artificial Intelligence 34 (05): 7797–7804, apr, 2020.
Goyal, A., Gupta, V., and Kumar, M. Recent named entity recognition and classification techniques: A systematic review. Comput. Sci. Rev. vol. 29, pp. 21–43, 2018.
Hearst, M. A. Text tiling: Segmenting text into multi-paragraph subtopic passages. Computational Linguistics 23 (1): 33–64, 1997.
Hochreiter, S. and Schmidhuber, J. Long Short-Term Memory. Neural Computation 9 (8): 1735–1780, 11, 1997.
Koshorek, O., Cohen, A., Mor, N., Rotman, M., and Berant, J. Text segmentation as a supervised learning task. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, pp. 469–473, 2018.
Lafferty, J. D., McCallum, A., and Pereira, F. C. N. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning. ICML ’01. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, pp. 282–289, 2001.
LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., and Jackel, L. D. Backpropagation applied to handwritten zip code recognition. Neural computation 1 (4): 541–551, 1989.
Liu, X., Yang, N., Jiang, Y., Gu, L., and Shi, X. A parallel computing-based deep attention model for named entity recognition. The Journal of Supercomputing 76 (2): 814–830, sep, 2019.
Luz de Araujo, P. H., de Campos, T. E., and Magalhães Silva de Sousa, M. Inferring the source of official texts: Can svm beat ulmfit? In Computational Processing of the Portuguese Language, P. Quaresma, R. Vieira, S. Aluísio, H. Moniz, F. Batista, and T. Gonçalves (Eds.). Springer International Publishing, Cham, pp. 76–86, 2020.
Ma, X. and Hovy, E. End-to-end sequence labeling via bi-directional lstm-cnns-crf, 2016.
Passos, E. Doing legal research in brazil, 2002.
Pennington, J., Socher, R., and Manning, C. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pp. 1532–1543, 2014.
Riedl, M. and Biemann, C. TopicTiling: A text segmentation algorithm based on LDA. In Proceedings of ACL 2012 Student Research Workshop. Association for Computational Linguistics, Jeju Island, Korea, pp. 37–42, 2012.
Shen, Y., Yun, H., Lipton, Z., Kronrod, Y., and Anandkumar, A. Deep active learning for named entity recognition. In Proceedings of the 2nd Workshop on Representation Learning for NLP. Association for Computational Linguistics, Vancouver, Canada, pp. 252–256, 2017.
Tepper, M., Capurro, D., Xia, F., Vanderwende, L., and Yetisgen-Yildiz, M. Statistical section segmentation in free-text clinical records. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC’12). European Language Resources Association (ELRA), Istanbul, Turkey, pp. 2001–2008, 2012.
Toledo, J. I., Carbonell, M., Fornés, A., and Lladós, J. Information extraction from historical handwritten document images with a context-aware neural model. Pattern Recognition vol. 86, pp. 27–36, 2019.
Xu, Y., Wang, Y., Liu, T., Liu, J., Fan, Y., Qian, Y., Tsujii, J., and Chang, E. I. Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. Journal of the American Medical Informatics Association 21 (e1): e84–e92, 08, 2013.