Explorando a revisão de corpora por meio da comparação de regras gramaticais em padrões sintáticos

  • Wellington José Leite da Silva FGV
  • Alexandre Rademaker FGV / IBM Research
  • Leonel Figueiredo de Alencar FGV / UFC

Abstract


Language resources, such as corpora, are fundamental for the development of text processing tools. A resource currently considered fundamental for NLP in Portuguese is the corpus UD Bosque, part of the corpora collection in the Universal Dependencies (UD) project. Despite UD Bosque being originated from a manually revised (golden) corpus, several annotation consistency problems are encountered in its current version. In this work, we present the methodology to correct the problems of morphological annotations in the corpus; in particular, we correct morphological agreements of adjectives, determinants, and nouns. We discuss the errors, exceptions, or non-trivial cases, corrections that we made, and the impact of changes on the corpus on the training of statistical parsers.

References

Afonso, S., Bick, E., Haber, R., and Santos, D. (2002). Floresta sintá (c) tica: a treebank In Proceedings of the Third International Conference on Language for portuguese. Resources and Evaluation (LREC). ELRA.

Cunha, C. and Cintra, L. (1985). Nova gramática do português contemporâneo. LEXIKON Editora Digital ltda.

de Alencar, L. F., Cuconato, B., and Rademaker, A. (2018). Morphobr: An open source large-coverage full-form lexicon for morphological analysis of portuguese. Texto Livre: Linguagem e Tecnologia, 11(3):1–25.

de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021a). Universal Dependencies. Computational Linguistics, 47(2):255–308.

de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021b). Universal Dependencies. Computational Linguistics, 47(2):255–308.

Iwamoto, R., Kanayama, H., Rademaker, A., and Ohko, T. (2021). A Universal Dependencies corpora maintenance methodology using downstream application. In Proceedings of the Third Workshop on Computational Typology and Multilingual NLP, pages 23–31, Online. Association for Computational Linguistics.

Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing. Prentice-Hall, Inc., USA, 2 edition.

Manning, C. D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S. J., and McClosky, D. (2014). The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations, pages 55–60.

Mitchell, T. M. et al. (1997). Machine learning. McGraw-hill New York.

Oepen, S., Flickinger, D., Toutanova, K., and Manning, C. D. (2004). Lingo redwoods. Research on Language and Computation, 2(4):575–596.

Passos, G. P. (2018). A formal specification for syntactic annotation and its usage in corpus development and maintenance: a case study in universal dependencies. Master’s thesis, Universidade Federal do Rio de Janeiro.

Popel, M., Žabokrtský, Z., and Vojtek, M. (2017). Udapi: Universal API for Universal Dependencies. In Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 96–101, Gothenburg, Sweden. Association for Computational Linguistics.

Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A python natural language processing toolkit for many human languages. CoRR, abs/2003.07082.

Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and de Paiva Universal Dependencies for Portuguese, V. (2017). Universal dependencies for portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling), pages 197–206, Pisa, Italy.

Ribeiro, L., Zulini, J. P., and Rademaker, A. (2020). The construction of a corpus from the brazilian historical-biographical dictionary. In Quaresma, P., Vieira, R., Aluísio, S., Moniz, H., Batista, F., and Gonçalves, T., editors, Computational Processing of the Portuguese Language, pages 109–117, Cham. Springer International Publishing.

Sag, I. A., Wasow, T., and Bender, E. M. (2003). Syntactic Theory: a formal introduction. University of Chicago Press, Chicago, second edition edition.

Straka, M. and Straková, J. (2017). Tokenizing, pos tagging, lemmatizing and parsing In Proceedings of the CoNLL 2017 Shared Task: Multilingual ud 2.0 with udpipe. Parsing from Raw Text to Universal Dependencies, pages 88–99, Vancouver, Canada. Association for Computational Linguistics.

Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., and Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 1–21, Brussels, Belgium. Association for Computational Linguistics.
Published
2021-11-29
SILVA, Wellington José Leite da; RADEMAKER, Alexandre; ALENCAR, Leonel Figueiredo de. Explorando a revisão de corpora por meio da comparação de regras gramaticais em padrões sintáticos. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 13. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 19-28. DOI: https://doi.org/10.5753/stil.2021.17780.