Named Entities in Stock Market Tweets: A Fine-Grained and Linguistically-Motivated Annotation

  • Laís Piai UFSCar
  • Ariani Di-Felippo UFSCar
  • Norton Trevisan Roman USP

Abstract


This work provides a second look at the Named Entity annotation of DANTEStocks – a corpus of stock market tweets in Portuguese – offering an in-depth analysis of the linguistic decisions involved in creating a gold-standard annotation tailored to this genre and domain. Our methodology builds on the guidelines of the Second HAREM evaluation, extending and reinterpreting them to the adopted genre and domain. The article furnishes then an analysis of the linguistic phenomena that challenge to this task, proposes specific strategies for entity delimitation and classification, and presents a linguistic characterization of the corpus based on the class distribution that resulted from the annotation.

References

da Silva, F. J. V., Roman, N. T., and Carvalho, A. M. B. R. (2020). Stock market tweets annotated with emotions. Corpora, 15(3):343–354.

de Marneffe, M.-C., Manning, C. D., Nivre, J., and Zeman, D. (2021). Universal Dependencies. Computational Linguistics, 47(2):255–308.

Derczynski, L., Bontcheva, K., and Roberts, I. (2016). Broad Twitter corpus: A diverse named entity recognition resource. In Matsumoto, Y. and Prasad, R., editors, Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages 1169–1179, Osaka, Japan. The COLING 2016 Organizing Committee.

Deveikyte, J., Geman, H., Piccari, C., and Provetti, A. (2022). A sentiment analysis approach to the prediction of market volatility. Frontiers in Artificial Intelligence, 5.

Di Felippo, A., Nunes, M. d. G. V., and Barbosa, B. K. d. S. (2024a). A dependency treebank of tweets in Brazilian Portuguese: Syntactic annotation issues and approach. In Proceedings of the XV Symposium in Information and Human Language Technology, pages 192–201, Porto Alegre, RS, Brasil. SBC.

Di Felippo, A., Roman, N., Barbosa, B., and Pardo, T. (2024b). Genipapo a multigenre dependency parser for Brazilian Portuguese. In Proceedings of the XV Symposium in Information and Human Language Technology, pages 257–266, Porto Alegre, RS, Brasil. SBC.

Di Felippo, A., Roman, N., Pardo, T., and Moura, L. (2024c). The DANTEStocks corpus: an analysis of the distribution of Universal Dependencies-based Part-of-Speech tags. Revista da Abralin, 22:493–544.

Di Felippo, A. and Roman, N. T. (2025). DANTEStocks: a multi-layered annotated corpus of stock market tweets for Brazilian Portuguese. Brazilian Journal of Applied Linguistics, Corpus Linguistics: Studies and Applications:1–23. To appear.

Finin, T., Murnane, W., Karandikar, A., Keller, N., Martineau, J., and Dredze, M. (2010). Annotating named entities in twitter data with crowdsourcing. In Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, pages 80–88.

Jurafsky, D. and Martin, J. H. (2025). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech Recognition. Prentice Hall, 3rd (draft) edition.

Liu, X., Zhang, S., Wei, F., and Zhou, M. (2011). Recognizing named entities in tweets. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 359–367, Portland, Oregon, USA. Association for Computational Linguistics.

Liu, Y., Zhu, Y., Che, W., Qin, B., Schneider, N., and Smith, N. A. (2018). Parsing tweets into Universal Dependencies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 965–975, New Orleans, Louisiana. Association for Computational Linguistics.

Mota, C. and Santos, D. (2008). Desafios na avaliação conjunta do reconhecimento de entidades mencionadas: O Segundo HAREM. Linguateca.

Peres, R., Esteves, D., and Maheshwari, G. (2017). Bidirectional lstm with a context input window for named entity recognition in tweets. In Proceedings of the 9th Knowledge Capture Conference, K-CAP ’17, New York, NY, USA. Association for Computing Machinery.

Ritter, A., Clark, S., Mausam, and Etzioni, O. (2011). Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524–1534, Edinburgh, Scotland, UK. Association for Computational Linguistics.

Rowe, M., Stankovic, M., Dadzie, A.-S., Nunes, B. P., and Cano, A. E. (2013). Making sense of microposts (msm2013): Big things come in small packages. In Proceedings of the 22nd International Conference Companion on World Wide Web. ACM. Workshop on Making Sense of Microposts.

Sanguinetti, M., Bosco, C., Cassidy, L., and et al. (2023). Treebanking user-generated content: a ud based overview of guidelines, corpora and unified recommendations. Language Resources & Evaluation, 57:493–544.

Sekine, S. and Nobata, C. (2004). Definition, dictionaries and tagger for extended named entity hierarchy. In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC’04), Lisbon, Portugal. European Language Resources Association (ELRA).

Silva, E. H., Pardo, T. A. S., Roman, N. T., and Di-Felippo, A. (2021). Universal Dependencies for tweets in Brazilian Portuguese: tokenization and part of speech tagging. In Proceedings of the 18th National Meeting on Artificial and Computational Intelligence, pages 1–12.

Zerbinati, M. M. and Roman, N. T. (2023). Manual de anotação de entidades nomeadas do dantestocks utilizando categorias do segundo harem. Technical Report PPgSI000/2023, PPgSI-EACH-USP, São Paulo, SP.

Zerbinati, M. M., Roman, N. T., and Di-Felippo, A. (2024). A corpus of stock market tweets annotated with named entities. In Proceedings of the 16th International Conference on Computational Processing of Portuguese Vol. 1, pages 276–284, Santiago de Compostela, Galicia/Espanha. Association for Computational Linguistics.
Published
2025-09-29
PIAI, Laís; DI-FELIPPO, Ariani; ROMAN, Norton Trevisan. Named Entities in Stock Market Tweets: A Fine-Grained and Linguistically-Motivated Annotation. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 654-663. DOI: https://doi.org/10.5753/stil.2025.37868.