Disambiguation of lemma and morphological attributes in the annotation of the Porttinari-base corpus
Abstract
This paper reports the process of disambiguating lemmas and morphological features in a corpus of Portuguese annotated with Universal Dependencies tagset. We explain the strategies adopted to simplify and reduce the workload of annotators. These strategies contribute to improve the accuracy of linguistic annotation, which is fundamental for various Natural Language Processing tasks.
References
Branco, A., Silva, J. R., Gomes, L., and Ant onio Rodrigues, J. (2022). Universal grammatical dependencies for Portuguese with CINTIL data, LX processing and CLARIN support. In Calzolari, N., Béchet, F., Blache, P., Choukri, K., Cieri, C., Declerck, T., Goggi, S., Isahara, H., Maegaard, B., Mariani, J., Mazo, H., Odijk, J., and Piperidis, S., editors, Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 5617–5626, Marseille, France. European Language Resources Association.
de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308.
Duran, M., Lopes, L., das Graças Nunes, M., and Pardo, T. (2023). The dawn of the porttinari multigenre treebank: Introducing its journalistic portion. In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 115–124, Porto Alegre, RS, Brasil. SBC.
Duran, M., Lopes, L., and Pardo, T. (2021). Descrição de numerais segundo modelo universal dependencies e sua anotação no português. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 344–352, Porto Alegre, RS, Brasil. SBC.
Duran, M. S., Oliveira, H., and Scandarolli, C. (2022). Que simples que nada: a anotação da palavra que em corpus de UD. In Pardo, T. A. S., Di-Felippo, A., and Roman, N. T., editors, Proceedings of the Universal Dependencies Brazilian Festival, pages 1–11, Fortaleza, Brazil. Association for Computational Linguistics. [link]
Gamba, F. and Zeman, D. (2023). Universalising Latin Universal Dependencies: a harmonisation of Latin treebanks in UD. In Grobol, L. and Tyers, F., editors, Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023), pages 7–16, Washington, D.C. Association for Computational Linguistics.
Goldberg, Y. (2015). A primer on neural network models for natural language processing.CoRR, abs/1510.00726.
Lopes, L., Duran, M., Fernandes, P., and Pardo, T. (2022). Portilexicon-ud: a portuguese lexical resource according to universal dependencies model. In Proceedings of the Language Resources and Evaluation Conference, pages 6635–6643, Marseille, France. European Language Resources Association.
Lopes, L., Fernandes, P., Inacio, M. L., Duran, M. S., and Pardo, T. A. S. (2023). Disambiguation of universal dependencies part-of-speech tags of closed class words in portuguese. In Naldi, M. C. and Bianchi, R. A. C., editors, Intelligent Systems, pages 241–255, Cham. Springer Nature Switzerland
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., Mc-Donald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. (2016).Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, Portoroz, Slovenia. ELRA.
Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and de Paiva, V. (2017). Universal Dependencies for Portuguese. In Montemagni, S. and Nivre, J., editors, Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 197–206, Pisa, Italy. Linköping University Electronic Press
Souza, E., Silveira, A., Cavalcanti, T., Castro, M., and Freitas, C. (2021). Petrogold – corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 29–38, Porto Alegre, RS, Brasil. SBC.
Universal Dependencies (2023). CoNLL-U format - UD version 2. [link]. Accessed: 2021-06-14.
de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308.
Duran, M., Lopes, L., das Graças Nunes, M., and Pardo, T. (2023). The dawn of the porttinari multigenre treebank: Introducing its journalistic portion. In Anais do XIV Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 115–124, Porto Alegre, RS, Brasil. SBC.
Duran, M., Lopes, L., and Pardo, T. (2021). Descrição de numerais segundo modelo universal dependencies e sua anotação no português. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 344–352, Porto Alegre, RS, Brasil. SBC.
Duran, M. S., Oliveira, H., and Scandarolli, C. (2022). Que simples que nada: a anotação da palavra que em corpus de UD. In Pardo, T. A. S., Di-Felippo, A., and Roman, N. T., editors, Proceedings of the Universal Dependencies Brazilian Festival, pages 1–11, Fortaleza, Brazil. Association for Computational Linguistics. [link]
Gamba, F. and Zeman, D. (2023). Universalising Latin Universal Dependencies: a harmonisation of Latin treebanks in UD. In Grobol, L. and Tyers, F., editors, Proceedings of the Sixth Workshop on Universal Dependencies (UDW, GURT/SyntaxFest 2023), pages 7–16, Washington, D.C. Association for Computational Linguistics.
Goldberg, Y. (2015). A primer on neural network models for natural language processing.CoRR, abs/1510.00726.
Lopes, L., Duran, M., Fernandes, P., and Pardo, T. (2022). Portilexicon-ud: a portuguese lexical resource according to universal dependencies model. In Proceedings of the Language Resources and Evaluation Conference, pages 6635–6643, Marseille, France. European Language Resources Association.
Lopes, L., Fernandes, P., Inacio, M. L., Duran, M. S., and Pardo, T. A. S. (2023). Disambiguation of universal dependencies part-of-speech tags of closed class words in portuguese. In Naldi, M. C. and Bianchi, R. A. C., editors, Intelligent Systems, pages 241–255, Cham. Springer Nature Switzerland
Nivre, J., de Marneffe, M.-C., Ginter, F., Goldberg, Y., Hajič, J., Manning, C. D., Mc-Donald, R., Petrov, S., Pyysalo, S., Silveira, N., Tsarfaty, R., and Zeman, D. (2016).Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 1659–1666, Portoroz, Slovenia. ELRA.
Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and de Paiva, V. (2017). Universal Dependencies for Portuguese. In Montemagni, S. and Nivre, J., editors, Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 197–206, Pisa, Italy. Linköping University Electronic Press
Souza, E., Silveira, A., Cavalcanti, T., Castro, M., and Freitas, C. (2021). Petrogold – corpus padrão ouro para o domínio do petróleo. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 29–38, Porto Alegre, RS, Brasil. SBC.
Universal Dependencies (2023). CoNLL-U format - UD version 2. [link]. Accessed: 2021-06-14.
Published
2024-11-17
How to Cite
LOPES, Lucelene; DURAN, Magali S.; PARDO, Thiago Alexandre Salgueiro.
Disambiguation of lemma and morphological attributes in the annotation of the Porttinari-base corpus. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 15. , 2024, Belém/PA.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 336-345.
DOI: https://doi.org/10.5753/stil.2024.245213.
