Verifica-UD: a Verifier for Universal Dependencies Annotation for Portuguese

Resumo


This paper presents Verifica-UD, a web-based tool to detect problems in Portuguese sentences annotated using Universal Dependencies (UD) standards in the form of a CoNLL-U file. The tool performs three levels of sentence verification: structural (to assess CoNLL-U compliance), morphosyntactic (to assess the part of speech tagging), and syntactic (to assess the parsing information). Verifica-UD also provides detailed help on Portuguese UD annotation directives. The benefits of this tool for reviewing annotated corpora are illustrated with an experiment.

Palavras-chave: NLP resources for Portuguese, UD validation tool, Universal Dependencies, annotation verifier, Portuguese language

Referências

de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308.

Duran, M. S. (2021). Manual de anotação de PoS tags: Orientações para anotação de etiquetas morfossintáticas em língua portuguesa, seguindo as diretrizes da abordagem universal dependencies (UD). Technical Report 434, ICMC-USP.

Duran, M. S. (2022). Manual de anotação de relações de dependência: Orientações para anotação de relações de dependência em língua portuguesa, seguindo as diretrizes da abordagem universal dependencies (UD). Technical Report 440, ICMC-USP.

Grobol, L. (2021). VSCode language support for CoNLL-U. [link]. Acessed: 2023-06-26.

Guibon, G., Courtin, M., Gerdes, K., and Guillaume, B. (2020). When collaborative treebank curation meets graph grammars. In Proceedings of The 12th Language Resources and Evaluation Conference (LREC), pages 5293–5302, Marseille, France. European Language Resources Association.

Lopes, L., Duran, M., Fernandes, P., and Pardo, T. (2022). PortiLexicon-UD: a Portuguese lexical resource according to Universal Dependencies model. In Proceedings of the Thirteenth Language Resources and Evaluation Conference (LREC), pages 6635–6643, Marseille, France. European Language Resources Association.

Lopes, L., Duran, M. S., and Pardo, T. A. S. (2023a). Atribuição de lemas e atributos morfológicos seguindo as decisões adotadas na anotação do córpus Porttinari-base dentro das diretrizes da Universal Dependencies (UD). Technical Report -, ICMC-USP. To appear.

Lopes, L., Duran, M. S., and Pardo, T. A. S. (2023b). Verifica-UD - uma ferramenta online para verificação de textos em português anotados no formato CoNLL-U segundo o padrão Universal Dependencies. Technical Report -, ICMC-USP. To appear.

Miranda, L. G. M. and Pardo, T. A. S. (2022). UDConcord: a concordancer for universal dependencies treebanks. In Proceedings of the Universal Dependencies Brazilian Festival (UDFest-BR), pages 1–10. Association for Computational Linguistics.

Nivre, J., de Marneffe, M.-C., Ginter, F., Hajič, J., Manning, C. D., Pyysalo, S., Schuster, S., Tyers, F., and Zeman, D. (2020). Universal Dependencies v2: An evergrowing multilingual treebank collection. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 4034–4043, Marseille, France. European Language Resources Association.

Richardson, L. and Ruby, S. (2007). RESTful Web Services. O’Reilly, Beijing.

Straka, M. (2018). UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207.

Villa, L. B. (2022). Udeasy: a tool for querying treebanks in conll-u format. In Proc. of the Workshop on Challenges in the Management of Large Corpora (CMLC), pages 16–19, Marseille, France. European Language Resources Association.
Publicado
25/09/2023
LOPES, Lucelene; DURAN, Magali Sanches; PARDO, Thiago Alexandre Salgueiro. Verifica-UD: a Verifier for Universal Dependencies Annotation for Portuguese. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 443-452. DOI: https://doi.org/10.5753/stil.2023.25485.