Syntax in court: presenting and exploring a portuguese legal corpus annotated according to the Universal Dependencies model
Abstract
Legal texts represent a challenge for the field of Natural Language Processing, given the complexity and style characteristic of this type of text. In this paper, we contribute to the development of this area in the context of Brazilian Portuguese through two fronts. First, we present PortJur, a new legal corpus syntactically annotated according to the Universal Dependencies model, resulting, in addition to the relevant linguistic study, in a novel resource for Portuguese. Then, we explore the annotated corpus to create specialized lexical resources, such as lists of content words, verb forms, abbreviations and loanwords, and a gazetteer of named entities.References
Albuquerque, H. O.; Costa, R.; Silvestre, G.; Souza, E.; Silva, N. F. F.; Vitório, D.; Moriyama, G.; Martins, L.; Soezima, L.; Nunes, A.; Siqueira, F.; Tarrega, J. P.; Beinotti, J. V.; Dias, M.; Silva, M.; Gardini, M.; Silva, V.; Carvalho, A. C. P. L. F.; Oliveira, Adriano L. I. (2022). UlyssesNER-Br: A Corpus of Brazilian Legislative Documents for Named Entity Recognition. In Proceedings of the 15th Computational Processing of the Portuguese Language, pp. 3-14.
Branco, A.; Silva, J. R.; Gomes, L.; Rodrigues, J. A. (2022). Universal Grammatical Dependencies for Portuguese with CINTIL Data, LX Processing and CLARIN support. In Proceedings of the 13th Language Resources and Evaluation Conference, pp. 5617–5626.
Brito, M.; Pinheiro, V.; Furtado, V.; Monteiro Neto, J.; Bomfim, F.; Costa, A.; Silveira, R. (2023). CDJUR-BR Uma Coleção Dourada do Judiciário Brasileiro com Entidades Nomeadas Refinadas. In Proceedings of the 14th Symposium in Information and Human Language Technology, pp. 177-186.
Castro, M.; Neves, A. R. (2024). PLN e Segurança Jurídica: Identificação de divergências jurisprudenciais com Processamento de Linguagem Natural. In Proceedings of the 15th Symposium in Information and Human Language Technology, pp. 451-456.
de Marneffe, M. C.; Manning, C. D.; Nivre, J.; Zeman, D. (2021). Universal Dependencies. Computational Linguistics, v. 47, n. 2, pp. 255-308.
Di Felippo, A.; Nunes, M. G. V.; Barbosa, B. (2024). A Dependency Treebank of Tweets in Brazilian Portuguese: Syntactic Annotation Issues and Approach. In Proceedings of the 15th Symposium in Information and Human Language Technology, pp. 192-201.
Duran, M. S.; Lopes, L.; Nunes, M. G. V.; Pardo, T. A. S. (2023). The Dawn of the Porttinari Multigenre Treebank: Introducing its Journalistic Portion. In Proceedings of the 14th Symposium in Information and Human Language Technology, pp. 115-124.
Fama, I.; Bueno, B.; Alcoforado, A.; Ferraz, T.; Moya, A.; Costa, A. (2024). No Argument Left Behind: Overlapping Chunks for Faster Processing of Arbitrarily Long Legal Texts. In Proceedings of the 15th Symposium in Information and Human Language Technology, pp. 129-138.
Feijó, D. V.; Moreira, V. P. (2018). RulingBR: A summarization dataset for legal texts. In Proceedings of the 13th Computational Processing of the Portuguese Language, pp. 255-264.
Ferrari, L. A.; Marques, C. G. F. (2022). O LEX-BR-Ius: arquitetura e decisões na compilação de um corpus representativo das leis federais brasileiras. ANTARES: Letras e Humanidades, v. 14, n. 34, pp. 40-77.
Ferrari, L. A.; Cunha, E. L. T. P. (2022). Reflexões metodológicas sobre datasets e linguística de corpus: uma análise preliminar de dados legislativos. Domínios de Lingu@gem, v. 16, n. 4, pp. 1571-1607.
Garcia, E.; Silva, N.; Siqueira, F.; Gomes, J.; Albuquerque, H. O.; Souza, E.; Lima, E.; Carvalho, A. (2024). RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 374-383.
Haverinen, K.; Nyblom, J.; Viljanen, T.; Laippala, V.; Kohonen, S.; Missilä, A.; Ojala, S.; Salakoski, T.; Ginter, F. (2014). Building the essential resources for Finnish: the Turku Dependency Treebank. Language Resources and Evaluation, v. 48, pp. 493-531.
Kríž, V.; Hladká, B. (2018). Czech Legal Text Treebank 2.0. In Proceedings of the 11th International Conference on Language Resources and Evaluation, pp. 4501-4505.
Leal S. E.; Duran, M. S.; Scarton, C. S.; Hartmann, N. S.; Aluísio, S. M. (2023). NILCMetrix: assessing the complexity of written and spoken language in Brazilian Portuguese. Language Resources and Evaluation, v. 58, pp. 73-110.
Lins, A. A.; Carvalho, C. S.; Bomfim, F. C. J.; Bentes, D. C.; Pinheiro, V. (2024). CLSJUR.BR A Model for Abstractive Summarization of Legal Documents in Portuguese Language based on Contrastive Learning. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 321-331.
Lopes, L.; Duran, M. S.; Pardo, T. A. S. (2023). Verifica-UD: a Verifier for Universal Dependencies Annotation for Portuguese. In Proceedings of the 2nd Edition of the Universal Dependencies Brazilian Festival, pp. 451-460.
Lopes, L.; Pardo, T. A. S. (2024). Towards Portparser a highly accurate parsing system for Brazilian Portuguese following the Universal Dependencies framework. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 401-410.
McDonald, R.; Nivre, J.; Quirmbach-Brundage, Y.; Goldberg, Y.; Das, D.; Ganchev, K.; Hall, K.; Petrov, S.; Zhang, H.; Täckström, O.; Bedini, C.; Castelló, N. B.; Lee , J. (2013).
Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 92–97.
Pardo, T. A. S.; Duran, M. S.; Lopes, L.; Di Felippo, A.; Roman, N. T.; Nunes, M. G. V. (2021). Porttinari a large multi-genre treebank for brazilian portuguese. In Proceedings of the XIII Symposium in Information and Human Language, pp. 1-10.
Rademaker, A.; Chalub, F.; Real, L.; Freitas, C.; Bick, E.; Paiva, V. (2017). Universal Dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics, pp. 197-206.
Sanguinetti, M.; Bosco, C. (2015). PartTUT: The Turin University Parallel Treebank. In Basili, R., Bosco, C., Delmonte, R., Moschitti, A., Simi, M. (eds), Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project. Studies in Computational Intelligence, v. 589. Springer, Cham.
Siqueira, F. A.; Vitório, D.; Souza, E.; Santos, J. A. P.; Albuquerque, H. O.; Dias, M. S.; Silva, N. F. F.; Carvalho, A. C. P. L. F.; Oliveira, A. L. I.; Bastos-Filho, C. (2024).
Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain. Language Resources and Evaluation, pp. 1-20.
Souza, E.; Silveira, A.; Cavalcanti, T.; Castro, M. C.; Freitas, C. (2021). PetroGold–Corpus padrão ouro para o domínio do petróleo. In Proceedings of the 13th Symposium in Information and Human Language Technology, pp. 29-38.
Souza, E.; Albuquerque, H. O.; Silva, N. F. F.; Cerqueira, M.; Carvalho, A. C. P. L. F.; Oliveira, A. L. I. (2024). PLN no Direito REN: Reconhecimento de Entidades Nomeadas no Domínio Legal: um Panorama para a Língua Portuguesa. In Caseli, H. M. and Nunes, M. G. V. (eds), Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. 3a edição BPLN.
Sturzeneker, M. L.; Morales, M. C. R.; Rocha, M. L. S. J.; Finger, M.; Sousa, M. C. P.; Monte, V. M.; Namiuti, C. (2022). Carolinás Methodology: building a large corpus with provenance and typology information. In Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing, v. 3128, p. 53-58.
Xavier, R. C. (2002). Português no Direito: Linguagem Forense. Rio de Janeiro: Forense.
Zeldes, A. (2017). The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation, v. 51, n. 3, pp. 581-612.
Branco, A.; Silva, J. R.; Gomes, L.; Rodrigues, J. A. (2022). Universal Grammatical Dependencies for Portuguese with CINTIL Data, LX Processing and CLARIN support. In Proceedings of the 13th Language Resources and Evaluation Conference, pp. 5617–5626.
Brito, M.; Pinheiro, V.; Furtado, V.; Monteiro Neto, J.; Bomfim, F.; Costa, A.; Silveira, R. (2023). CDJUR-BR Uma Coleção Dourada do Judiciário Brasileiro com Entidades Nomeadas Refinadas. In Proceedings of the 14th Symposium in Information and Human Language Technology, pp. 177-186.
Castro, M.; Neves, A. R. (2024). PLN e Segurança Jurídica: Identificação de divergências jurisprudenciais com Processamento de Linguagem Natural. In Proceedings of the 15th Symposium in Information and Human Language Technology, pp. 451-456.
de Marneffe, M. C.; Manning, C. D.; Nivre, J.; Zeman, D. (2021). Universal Dependencies. Computational Linguistics, v. 47, n. 2, pp. 255-308.
Di Felippo, A.; Nunes, M. G. V.; Barbosa, B. (2024). A Dependency Treebank of Tweets in Brazilian Portuguese: Syntactic Annotation Issues and Approach. In Proceedings of the 15th Symposium in Information and Human Language Technology, pp. 192-201.
Duran, M. S.; Lopes, L.; Nunes, M. G. V.; Pardo, T. A. S. (2023). The Dawn of the Porttinari Multigenre Treebank: Introducing its Journalistic Portion. In Proceedings of the 14th Symposium in Information and Human Language Technology, pp. 115-124.
Fama, I.; Bueno, B.; Alcoforado, A.; Ferraz, T.; Moya, A.; Costa, A. (2024). No Argument Left Behind: Overlapping Chunks for Faster Processing of Arbitrarily Long Legal Texts. In Proceedings of the 15th Symposium in Information and Human Language Technology, pp. 129-138.
Feijó, D. V.; Moreira, V. P. (2018). RulingBR: A summarization dataset for legal texts. In Proceedings of the 13th Computational Processing of the Portuguese Language, pp. 255-264.
Ferrari, L. A.; Marques, C. G. F. (2022). O LEX-BR-Ius: arquitetura e decisões na compilação de um corpus representativo das leis federais brasileiras. ANTARES: Letras e Humanidades, v. 14, n. 34, pp. 40-77.
Ferrari, L. A.; Cunha, E. L. T. P. (2022). Reflexões metodológicas sobre datasets e linguística de corpus: uma análise preliminar de dados legislativos. Domínios de Lingu@gem, v. 16, n. 4, pp. 1571-1607.
Garcia, E.; Silva, N.; Siqueira, F.; Gomes, J.; Albuquerque, H. O.; Souza, E.; Lima, E.; Carvalho, A. (2024). RoBERTaLexPT: A Legal RoBERTa Model pretrained with deduplication for Portuguese. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 374-383.
Haverinen, K.; Nyblom, J.; Viljanen, T.; Laippala, V.; Kohonen, S.; Missilä, A.; Ojala, S.; Salakoski, T.; Ginter, F. (2014). Building the essential resources for Finnish: the Turku Dependency Treebank. Language Resources and Evaluation, v. 48, pp. 493-531.
Kríž, V.; Hladká, B. (2018). Czech Legal Text Treebank 2.0. In Proceedings of the 11th International Conference on Language Resources and Evaluation, pp. 4501-4505.
Leal S. E.; Duran, M. S.; Scarton, C. S.; Hartmann, N. S.; Aluísio, S. M. (2023). NILCMetrix: assessing the complexity of written and spoken language in Brazilian Portuguese. Language Resources and Evaluation, v. 58, pp. 73-110.
Lins, A. A.; Carvalho, C. S.; Bomfim, F. C. J.; Bentes, D. C.; Pinheiro, V. (2024). CLSJUR.BR A Model for Abstractive Summarization of Legal Documents in Portuguese Language based on Contrastive Learning. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 321-331.
Lopes, L.; Duran, M. S.; Pardo, T. A. S. (2023). Verifica-UD: a Verifier for Universal Dependencies Annotation for Portuguese. In Proceedings of the 2nd Edition of the Universal Dependencies Brazilian Festival, pp. 451-460.
Lopes, L.; Pardo, T. A. S. (2024). Towards Portparser a highly accurate parsing system for Brazilian Portuguese following the Universal Dependencies framework. In Proceedings of the 16th International Conference on Computational Processing of Portuguese, pp. 401-410.
McDonald, R.; Nivre, J.; Quirmbach-Brundage, Y.; Goldberg, Y.; Das, D.; Ganchev, K.; Hall, K.; Petrov, S.; Zhang, H.; Täckström, O.; Bedini, C.; Castelló, N. B.; Lee , J. (2013).
Universal Dependency Annotation for Multilingual Parsing. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pp. 92–97.
Pardo, T. A. S.; Duran, M. S.; Lopes, L.; Di Felippo, A.; Roman, N. T.; Nunes, M. G. V. (2021). Porttinari a large multi-genre treebank for brazilian portuguese. In Proceedings of the XIII Symposium in Information and Human Language, pp. 1-10.
Rademaker, A.; Chalub, F.; Real, L.; Freitas, C.; Bick, E.; Paiva, V. (2017). Universal Dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics, pp. 197-206.
Sanguinetti, M.; Bosco, C. (2015). PartTUT: The Turin University Parallel Treebank. In Basili, R., Bosco, C., Delmonte, R., Moschitti, A., Simi, M. (eds), Harmonization and Development of Resources and Tools for Italian Natural Language Processing within the PARLI Project. Studies in Computational Intelligence, v. 589. Springer, Cham.
Siqueira, F. A.; Vitório, D.; Souza, E.; Santos, J. A. P.; Albuquerque, H. O.; Dias, M. S.; Silva, N. F. F.; Carvalho, A. C. P. L. F.; Oliveira, A. L. I.; Bastos-Filho, C. (2024).
Ulysses Tesemõ: a new large corpus for Brazilian legal and governmental domain. Language Resources and Evaluation, pp. 1-20.
Souza, E.; Silveira, A.; Cavalcanti, T.; Castro, M. C.; Freitas, C. (2021). PetroGold–Corpus padrão ouro para o domínio do petróleo. In Proceedings of the 13th Symposium in Information and Human Language Technology, pp. 29-38.
Souza, E.; Albuquerque, H. O.; Silva, N. F. F.; Cerqueira, M.; Carvalho, A. C. P. L. F.; Oliveira, A. L. I. (2024). PLN no Direito REN: Reconhecimento de Entidades Nomeadas no Domínio Legal: um Panorama para a Língua Portuguesa. In Caseli, H. M. and Nunes, M. G. V. (eds), Processamento de Linguagem Natural: Conceitos, Técnicas e Aplicações em Português. 3a edição BPLN.
Sturzeneker, M. L.; Morales, M. C. R.; Rocha, M. L. S. J.; Finger, M.; Sousa, M. C. P.; Monte, V. M.; Namiuti, C. (2022). Carolinás Methodology: building a large corpus with provenance and typology information. In Proceedings of the Second Workshop on Digital Humanities and Natural Language Processing, v. 3128, p. 53-58.
Xavier, R. C. (2002). Português no Direito: Linguagem Forense. Rio de Janeiro: Forense.
Zeldes, A. (2017). The GUM Corpus: Creating Multilayer Resources in the Classroom. Language Resources and Evaluation, v. 51, n. 3, pp. 581-612.
Published
2025-09-29
How to Cite
LOPES, Lucelene; NUNES, Maria das Graças V.; DURAN, Magali S.; PARDO, Thiago A. S..
Syntax in court: presenting and exploring a portuguese legal corpus annotated according to the Universal Dependencies model. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 220-232.
DOI: https://doi.org/10.5753/stil.2025.37827.
