PetroGold – Corpus padrão ouro para o domínio do petróleo
Abstract
This paper describes the creation of PetroGold, a gold standard treIt is composed of theses, dissertations and ebank for the oil & gas domain. monographs, contains 9,127 sentences (253,640 tokens) and has morphosyntactic annotation of dependencies according to the Universal Dependencies approach. We detail some of the linguistic challenges of the domain for syntactic annotation and assess the quality of the corpus through an intrinsic evaluation: using a model created by the UDPipe tool, the corpus leads to 90.65%, 88.53% and 82.88% of correct answers according to the UAS, LAS and CLAS measures, respectively.
References
Cohen, K. B., Verspoor, K., Fort, K., Funk, C., Bada, M., Palmer, M., and Hunter, L. E. (2017). The colorado richly annotated full text (craft) corpus: Multi-model annotation in the biomedical domain. In Handbook of Linguistic Annotation, pages 1379–1394. Springer.
de Souza, E. and Freitas, C. (2021). Et: A workstation for querying, editing and evaluating annotated corpora. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Online. Association for Computational Linguistics.
Duran, M. S. and Aluísio, S. (2011). Propbank-br: a brazilian portuguese corpus annotated with semantic role labels. In Proceedings of the 8th Brazilian Symposium in Information and Human Language Technology.
Evelyn, W. F. D. (2021). Dos termos às entidades no domínio de petróleo. Master’s thesis, PPGEL/PUC-Rio.
Freitas, C., Carvalho, P., Oliveira, H. G., Mota, C., and Santos, D. (2010). Second HAREM: advancing the state of the art of named entity recognition in Portuguese. In Calzolari, N., Choukri, K., Maegaard, B., Mariani, J., Odijk, J., Piperidis, S., Rosner, M., and Tapias, D., editors, Proceedings of the International Conference on Language Resources and Evaluation (LREC 2010), pages 3630–3637. European Language Resources Association.
Freitas, C., Rocha, P., and Bick, E. (2008). Um mundo novo na oresta sintá (c) tica–o treebank do português. Calidoscópio, 6(3):142–148.
Gamallo, P., Garcia, M., and Fernández-Lanza, S. (2012). Dependency-based open information extraction. In Proceedings of the joint workshop on unsupervised and semisupervised learning in NLP, pages 10–18.
Gomes, D., Cordeiro, F., and Evsukoff, A. (2018). Word embeddings em português para o domínio específico de óleo e gás. In Proceedings of the 19th Rio oil & gas expo and conference, page 10.
Nivre, J., De Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajic, J., Manning, C. D., McDonald, R., Petrov, S., Pyysalo, S., Silveira, N., et al. (2016). Universal dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666.
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., and Manning, C. D. (2020). Stanza: A Python In Proceedings of natural language processing toolkit for many human languages. the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations.
Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and de Paiva, V. (2017). Universal dependencies for portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 197–206.
Silveira, A., de Souza, E., Cavalcanti, T., and Freitas, C. (2019). Do pdf ao txt: Desafios na extração de informação em textos técnico-científicos. In VI Workshop de Iniciação Científica em Tecnologia da Informação e da Linguagem Humana (TILic 2019).
Straka, M., Hajic, J., and Straková, J. (2016). Udpipe: trainable pipeline for processing conll-u files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297.
Thompson, P., Ananiadou, S., and Tsujii, J. (2017). The genia corpus: Annotation levels and applications. In Handbook of Linguistic Annotation, pages 1395–1432. Springer.
Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., and Petrov, S. (2018). Conll 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, pages 1–21.
