Typology of orthographic and lexical phenomena in UCG: the case of stock market tweets

Abstract


Twitter is an attractive source of information for several Natural Language Processing (NLP) applications, especially sentiment analysis and opinion mining. In this paper, we present a systematic description of orthographic and lexical phenomena in a corpus of tweets from the stock market domain in Portuguese. As a result, we propose a typology of the phenomena that could support the definition of annotation guidelines for their treatment within the Universal Dependencies framework of syntactic analysis and the development of NLP applications that realize term disambiguation or probabilistic ordering of options, as is the case with suggestions presented to users by spelling checkers.

Keywords: corpus, tweet, linguistic phenomenon

References

Bertaglia, T.F.C. (2017). Normalização textual de conteúdo gerado por usuário. Dissertação, Instituto de Ciências Matemáticas e de Computação, USP, São Carlos.

Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors. Communications of the ACM, 7(3):171–176.

Di-Felippo, A.; Postali, C.; Ceregatto, G.; Gazana, L.S.; Silva, E.H.; Roman, N.T.; Pardo, T.A.S. (2021). Descrição preliminar do corpus DANTEStocks: diretrizes de segmentação para anotação segundo Universal Dependencies. In the Proceedings of the 7th Workshop on Portuguese Description (JDP), p. 335-343.

Faraco, C. A. (2008). Norma culta brasileira: desatando alguns nós. SP: Parábola Editorial.

Gimenes, P., Roman, N. T., Carvalho, A. M. B. R. (2015). Spelling error patterns in Brazilian Portuguese. Computational Linguistics, 41(1): 175–183.

Luotolahti, J., et al. (2015). Towards universal web parsebanks. In the Proceedings of the 3rd Depling 2015, p. 211–220. Uppsala University.

Nivre, J. et al. (2016). Universal Dependencies v1: a multilingual treebank collection. In the Proceedings of the 10th LREC, p.1659-66. Portorož. ELRA

Plutchik R., Kellerman, H. (ed.) (1986) Emotion: theory, research and experience. NY: Acad. Press.

Sanguinetti, M., Bosco, C., Cassidy, L., Çetinoğlu, Ö., Cignarella, A.T., Lynn, T., Rehbein, I. Ruppenhofer, J., Seddah, D., Zeldes, A. (2020). Treebanking user-generated content: a proposal for a unified representation in universal dependencies. In the Proceedings of the 12th LREC. p. 5240-50. Marseille, France. ELRA

Silva, F.J.V., Roman, N.T., Carvalho, A.M.B.R. (2020). Stock market tweets annotated with emotions. In Corpora, 15(3), p. 343-354. Online ISSN: 1755-1676.

Straka, M. (2018) UDPipe 2.0 prototype at CoNLL 2018 UD shared task. In the Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pages 197–207, Brussels, Belgium. ACL.
Published
2023-09-25
SCANDAROLLI, Clarissa Lenina; DI FELIPPO, Ariani; ROMAN, Norton Trevisan; PARDO, Thiago A. S.. Typology of orthographic and lexical phenomena in UCG: the case of stock market tweets. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 14. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 240-248. DOI: https://doi.org/10.5753/stil.2023.233948.