Annotation of fixed Multiword Expressions (MWEs) in a Portuguese Universal Dependencies (UD) treebank: Gathering candidates from three different sources


Delimiting and correctly annotating multiword expressions (MWEs) is an important task in constructing a gold standard treebank. In this paper, we applied three methods to the PetroGold corpus to identify MWE candidates. The methods include (1) leveraging expressions previously identified by the PALAVRAS annotator, (2) statistical analysis of collocations in Petroles, a larger non-annotated corpus, and (3) a curated list of co-occurring words from the POeTiSA project. Through extensive filtering and alignment with Universal Dependencies (UD) guidelines, we revised the annotations of 2,467 MWEs in the PetroGold corpus, we tested a new annotation for the part-of-speech (POS) of the words that are part of MWEs and we provide two computationally readable resources to assist other annotators.

Palavras-chave: Multiword Expressions (MWEs), Universal Dependencies (UD), Corpus annotation, Natural Language Processing, Collocation analysis


Bagno, M. (2012). Gramática pedagógica do português brasileiro. Parábola Ed.

Bick, E. (2014). PALAVRAS, a constraint grammar-based parsing system for Portuguese. Working with Portuguese corpora, pages 279–302.

Cordeiro, F. C. (2020). Petrolês-como construir um corpus especializado em óleo e gás em português. PUC-Rio, Rio de Janeiro, RJ-Brasil: PUC-Rio.

de Souza, E. (2023). Construção e avaliação de um treebank padrão ouro. Mestrado, PUC-Rio.

Lopes, L., Duran, M. S., and Pardo, T. A. (2021). Universal dependencies-based pos tagging refinement through linguistic resources. In Brazilian Conference on Intelligent Systems, pages 601–615. Springer.

Manning, C. and Schutze, H. (1999). Foundations of statistical natural language processing. MIT press.

Neves, M. H. d. M. (2000). Gramática de usos do português. Unesp.

Oliveira, C., Nogueira, C., and Garrao, M. (2004). Locution or collocation: comparing linguistic and statistical methods for recognising complex prepositions. In Anais do 2º Workshop em Tecnologia da Informação e da Linguagem Humana.

Rademaker, A., Chalub, F., Real, L., Freitas, C., Bick, E., and De Paiva, V. (2017). Universal dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling 2017), pages 197–206.

Ramisch, C. (2012). A generic framework for multiword expressions treatment: from acquisition to applications. In Proceedings of the ACL 2012 Student Research Workshop, Jeju, Republic of Korea. ACL.

Straka, M., Hajic, J., and Straková, J. (2016). UDPipe: trainable pipeline for processing CoNLL-U files performing tokenization, morphological analysis, pos tagging and parsing. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4290–4297.

Zeman, D., Hajic, J., Popel, M., Potthast, M., Straka, M., Ginter, F., Nivre, J., and Petrov, S. (2018). CoNLL 2018 shared task: Multilingual parsing from raw text to universal dependencies. In Proceedings of the CoNLL 2018 Shared Task: Multilingual parsing from raw text to universal dependencies, pages 1–21.
DE SOUZA, Elvis; FREITAS, Cláudia. Annotation of fixed Multiword Expressions (MWEs) in a Portuguese Universal Dependencies (UD) treebank: Gathering candidates from three different sources. In: SIMPÓSIO BRASILEIRO DE TECNOLOGIA DA INFORMAÇÃO E DA LINGUAGEM HUMANA (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 434-442. DOI: