verBERT: Automating Brazilian Case Law Document Multi-label Categorization Using BERT

Felipe R. Serras; Marcelo Finger

doi:10.5753/stil.2021.17803

Felipe R. Serras USP
Marcelo Finger USP

DOI: https://doi.org/10.5753/stil.2021.17803

Resumo

In this work, we carried out a study about the use of attention-based algorithms to automate the categorization of Brazilian case law documents. We used data from the Kollemata Project to produce two distinct datasets with adequate class systems. Then, we implemented a multi-class and multi-label version of BERT and fine-tuned different BERT models with the produced datasets. We evaluated several metrics, adopting the micro-averaged F1-Score as our main metric for which we obtained a performance value of 〈F₁〉_micro = 0.72 corresponding to gains of 30 percent points over the tested statistical baseline.

Referências

Aggarwal, C. C. and Zhai, C. (2012). A survey of text clustering algorithms. In Mining text data, pages 77–128. Springer.

Calambás, M. A., Ordóñez, A., Chacón, A., and Ordoñez, H. (2015). Judicial precedents search supported by natural language processing and clustering. In 2015 10th Computing Colombian Conference (10CCC), pages 372–377. IEEE.

Chalkidis, I., Fergadiotis, M., Malakasiotis, P., and Androutsopoulos, I. (2019). Largescale multi-label text classification on eu legislation. arXiv preprint arXiv:1906.02192.

de Colla Furquim, L. O. and De Lima, V. L. S. (2012). Clustering and categorization of brazilian portuguese legal documents. In International Conference on Computational Processing of the Portuguese Language, pages 272–283. Springer.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

Feinerer, I. and Hornik, K. (2008). Text mining of supreme administrative court jurisdictions. In Data Analysis, Machine Learning and Applications, pages 569–576. Springer.

Godbole, S. and Sarawagi, S. (2004). Discriminative methods for multi-labeled classification. In Pacific-Asia conference on knowledge discovery and data mining, pages 22–30. Springer.

Gonçalves, T. and Quaresma, P. (2003). A preliminary approach to the multilabel classification problem of portuguese juridical documents. In Portuguese Conference on Artificial Intelligence, pages 435–444. Springer.

Gonçalves, T. and Quaresma, P. (2005). Is linguistic information relevant for the classification of legal texts? In Proceedings of the 10th international conference on Artificial intelligence and law, pages 168–176.

Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.

Guimarães, J. A. C. (2004). ELABORAÇÃO DE EMENTAS JURISPRUDENCIAIS: elementos teórico-metodológicos, volume 9 of Série Monografias do CEJ.

Han, J. and Moraga, C. (1995). The influence of the sigmoid function parameters on the speed of backpropagation learning. In International workshop on artificial neural networks, pages 195–201. Springer.

Huyck, V. and Orengo, V. (2001). A stemming algorithm for the portuguese language. In SPIRE, volume 1, pages 186–193.

Kalchbrenner, N., Grefenstette, E., and Blunsom, P. (2014). A convolutional neural network for modelling sentences. In 52nd Annual Meeting of the Association for Computational Linguistics, ACL 2014 Proceedings of the Conference.

Koyejo, O., Natarajan, N., Ravikumar, P., and Dhillon, I. S. (2015). Consistent multilabel classification. In Advances in Neural Information Processing Systems.

Leite, J. A., Silva, D. F., Bontcheva, K., and Scarton, C. (2020). Toxic language detection in social media for brazilian portuguese: New dataset and multilingual analysis. arXiv preprint arXiv:2010.04543.

Loshchilov, I. and Hutter, F. (2017). Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.

Manning, C. and Schütze, H. (1999). Foundations of Statistical Natural Language Processing. Mit Press. MIT Press.

Mencía, E. L. and Fürnkranz, J. (2010). Efficient multilabel classification algorithms for large-scale problems in the legal domain. In Semantic Processing of Legal Texts, pages 192–215. Springer.

Moens, M.-F. (2001). Innovative techniques for legal text retrieval. Artificial Intelligence and Law, 9(1):29–57.

Salvatore, F., Finger, M., and Hirata Jr, R. (2019). A logical-based corpus for cross-lingual evaluation. arXiv preprint arXiv:1905.05704.

Sebastiani, F. (2002). Machine Learning in Automated Text Categorization. ACM Computing Surveys, 34(1):1–47.

Serras, F. R. (2021). Algoritmos baseados em atenção neural para a automação da classificação multirrótulo de acórdãos jurídicos. Master’s thesis, Instituto de Matemática e Estatística (IME).

Souza, F., Nogueira, R., and Lotufo, R. (2019). Portuguese named entity recognition using bert-crf. arXiv preprint arXiv:1909.10649.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).

Trivedi, K. (2019). Multi-label text classification using bert – the mighty transformer. [link]. Acessado em 02/03/2020.

Villata, S. et al. (2020). Natural language processing applications in case-law text publishing. In Legal Knowledge and Information Systems: JURIX 2020: The Thirty-third Annual Conference, Brno, Czech Republic, December 9-11, 2020, volume 334, page 154. IOS Press.

Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and Bowman, S. R. (2019). GLUE: A multi-task benchmark and analysis platform for natural language understanding. In the Proceedings of ICLR.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.