The Zé Lensky Dataset: A Brazilian Portuguese Twitter Corpus for Russo-Ukraine War Stance and Sentiment Analysis

  • Andreis G. M. Purim UNICAMP
  • Kārlis Kuškēvics UCCA

Abstract


This paper presents a work-in-progress corpus of over 200,000 Brazilian Portuguese tweets related to the Russia-Ukraine war, collected between 2022 and 2025. The dataset includes metadata and annotations for stance and irony, with particular attention to partisan and culturally specific expressions such as “zé lensky”, which we hypothesize indicate processes of sociolinguistic enregisterment in partisan speech. This resource enables further exploration of political discourse in Brazilian social media and supports future studies on narrative dynamics, user behavior, and informal political language. The dataset is available on GitHub.

References

Bhandari, A., Shah, S. B., Thapa, S., Naseem, U., and Nasim, M. (2023). Crisishatemm: Multimodal analysis of directed and undirected hate speech in text-embedded images from russia-ukraine conflict. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1994–2003.

Bittencourt, R. N. (2016). A culinária da política: coxinha, caviar e mortadela. Revista Espaço Acadêmico, 16(182):45–55.

Chen, E. and Ferrara, E. (2022). Tweets in time of conflict: A public dataset tracking the twitter discourse on the war between ukraine and russia. arXiv:v2, 2023-04-10.

Corrêa, U. B., Coelho, L., Santos, L., and de Freitas, L. A. (2021). Overview of the idpt task on irony detection in portuguese at iberlef 2021. Procesamiento del Lenguaje Natural, page 269–276.

Foerste, H. T. S., Purim, A. G. M., Souza, R. R., and Dos Reis, J. C. (2023). Assis: Online semi-automatic dialog annotation tool. In Proceedings of the XIX Brazilian Symposium on Information Systems, SBSI ’23, page 37–44, New York, NY, USA. Association for Computing Machinery.

Gal, S. (2019). Making registers in politics: Circulation and ideologies of linguistic authority. Journal of Sociolinguistics, 23(5):450–466.

Hakimov, S. and Cheema, G. S. (2024). Unveiling global narratives: A multilingual twitter dataset of news media on the russo-ukrainian conflict. In Proceedings of the 2024 International Conference on Multimedia Retrieval, ICMR ’24, page 1160–1164, New York, NY, USA. Association for Computing Machinery.

Hoinkes, U. (2019). Indexicality and enregisterment as theoretical approaches to the sociolinguistic analysis of romance languages. Technical report, Kiel University (MACAU Repository). Accessed 2025-04-07.

Lamprou, I., Shevtsov, A., Antonakaki, D., Pratikakis, P., and Ioannidis, S. (2024). Exploring crisis-driven social media patterns: A twitter dataset of usage during the russoukrainian war. In Social Networks Analysis and Mining: 16th International Conference, ASONAM 2024, Rende, Italy, September 2–5, 2024, Proceedings, Part I, page 70–85, Berlin, Heidelberg. Springer-Verlag.

Marten, G. S. and Freitas, L. A. d. (2021). The construction of a corpus for detecting irony and sarcasm in portuguese / a construção de um corpus para detectar a ironia e o sarcasmo em português. Brazilian Journal of Development, 7(5):47973–47984.

Pavan, M. C. and Paraboni, I. (2024). A benchmark for portuguese zero-shot stance detection. Journal of the Brazilian Computer Society, 30(1):469–479.

Pereira, C., Pavan, M., Yoon, S., Silva, R. M., Costa, P., Cavalheiro, L., and Paraboni, I. (2023). Ustancebr: A social media language resource for stance prediction.

Purim, A. G. M. and dos Reis, J. C. (2023). Active Learning for Natural Language Data Annotation. Technical Report IC-PFG-23-55, Institute of Computing, University of Campinas.

Silva, D. N. (2022). Papo reto: The politics of enregisterment amid the crossfire in rio de janeiro. Signs and Society, 10(2):239–264.

Silverstein, M. (2003). Indexical order and the dialectics of sociolinguistic life. Language & Communication, 23(3):193–229. Words and Beyond: Linguistic and Semiotic Studies of Sociocultural Order.

Squires, L. (2010). Enregistering internet language. Language in Society, 39(4):457–492.

Zappavigna, M. (2022). Social media quotation practices and ambient affiliation: Weaponising ironic quotation for humorous ridicule in political discourse. Journal of Pragmatics, 191:98–112.
Published
2025-09-29
PURIM, Andreis G. M.; KUŠKĒVICS, Kārlis. The Zé Lensky Dataset: A Brazilian Portuguese Twitter Corpus for Russo-Ukraine War Stance and Sentiment Analysis. In: BRAZILIAN SYMPOSIUM IN INFORMATION AND HUMAN LANGUAGE TECHNOLOGY (STIL), 16. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 564-571. DOI: https://doi.org/10.5753/stil.2025.37858.