Análise de sentimentos de conteúdo compartilhado em comunidades brasileiras do Reddit: Avaliação de um conjunto de dados rotulados por humanos
Resumo
The soaring use of social media and its impact on society have been raising ethical issues about the content disseminated by these platforms, particularly from the perspective of responsible AI given the need to mitigate the propagation of bias and the spread of toxic language. Sentiment Analysis of the language of these communities poses big challenges, since it requires quality datasets that can be used in supervised training of models. The social network Reddit comprises smaller, sub-communities centered on specific topics, called Subreddits. Through manual annotation of posts in Subreddits related to Brazilian content and communities, we have developed a dataset for Sentiment Analysis in Brazilian Portuguese. We report the results of our annotation process and characterize the language of the posts. Our dataset is meant to support Sentiment Analysis tasks for social media language in Brazilian Portuguese.
Referências
Jacob Amedie. 2015. The Impact of Social Media on Society. Advanced Writing: Pop Culture Intersections (2015). [link]
Francesco Barbieri, Luis Espinosa Anke, and Jose Camacho-Collados. 2022. XLM-T: Multilingual Language Models in Twitter for Sentiment Analysis and Beyond. In Proceedings of the Thirteenth Language Resources and Evaluation Conference. European Language Resources Association, Marseille, France, 258–266.
Jason Baumgartner, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn. 2020. The pushshift reddit dataset. In Proceedings of the international AAAI conference on web and social media, Vol. 14. 830–839.
Victoria Bobicev and Marina Sokolova. 2017. Inter-Annotator Agreement in Sentiment Analysis: Machine Learning Perspective. In Recent Advances in Natural Language Processing. 97–102. DOI: 10.26615/978-954-452-049-6_015
Dorottya Demszky, Dana Movshovitz-Attias, Jeongwoo Ko, Alan Cowen, Gaurav Nemade, and Sujith Ravi. 2020. GoEmotions: A Dataset of Fine-Grained Emotions. arXiv:2005.00547
Joseph L Fleiss. 1971. Measuring nominal scale agreement among many raters. Psychological bulletin 76, 5 (1971), 378.
Joseph L. Fleiss. 1975. Measuring Agreement between Two Judges on the Presence or Absence of a Trait. Biometrics 31, 3 (1975), 651–659. [link]
E Fonseca, L Santos, Marcelo Criscuolo, and S Aluisio. 2016. ASSIN: Avaliacao de similaridade semantica e inferencia textual. In Computational Processing of the Portuguese Language-12th International Conference, Tomar, Portugal. 13–15.
Tommaso Fornaciari, Alexandra Uma, Silviu Paun, Barbara Plank, Dirk Hovy, and Massimo Poesio. 2021. Beyond Black & White: Leveraging Annotator Disagreement via Soft-Label Multi-Task Learning. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tur, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, Online, 2591–2597. DOI: 10.18653/v1/2021.naacl-main.204
Claudia Freitas, Paulo Rocha, and Eckhard Bick. 2008. A new world in Floresta Sintá(c)tica – the Portuguese treebank. Calidoscópio 6, 3 (2008), 142–148. DOI: 10.4013/cld.20083.03
Simona Frenda, Alessandro Pedrani, Valerio Basile, Soda Marem Lo, Alessandra Teresa Cignarella, Raffaella Panizzon, Cristina Marco, Bianca Scarlini, Viviana Patti, Cristina Bosco, and Davide Bernardi. 2023. EPIC: Multi-Perspective Annotation of a Corpus of Irony. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (Eds.). Association for Computational Linguistics, Toronto, Canada, 13844–13857. DOI: 10.18653/v1/2023.acl-long.774
Jeffrey E. F. Friedl. 2006. Mastering regular expressions (3 ed.). O’Reilly Media.
Klaifer Garcia and Lilian Berton. 2021. Topic detection and sentiment analysis in Twitter content related to COVID-19 from Brazil and the USA. Applied Soft Computing 101 (2021) DOI: 10.1016/j.asoc.2020.107057
Maarten Grootendorst. 2022. BERTopic: Neural topic modeling with a class-based TF-IDF procedure. arXiv preprint arXiv:2203.05794 (2022).
Clayton Hutto and Eric Gilbert. 2014. Vader: A parsimonious rule-based model for sentiment analysis of social media text. In Proceedings of the international AAAI conference on web and social media, Vol. 8. 216–225.
Antônio Pereira De Souza Júnior, Pablo Cecilio, Felipe Viegas, Washington Cunha, Elisa Tuler De Albergaria, and Leonardo Chaves Dutra Da Rocha. 2022. Evaluating Topic Modeling Pre-processing Pipelines for Portuguese Texts. In Proceedings of the Brazilian Symposium on Multimedia and the Web (Curitiba, Brazil) (WebMedia’22). Association for Computing Machinery, New York, NY, USA, 191–201. DOI: 10.1145/3539637.3557052
Simon Kemp. 2024. Digital 2024 April Global Statshot Report. [link] Acessado: 13-06-2024.
Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. Hancock. 2014. Experimental evidence of massive-scale emotional contagion through social networks. Proceedings of the National Academy of Sciences 111, 24 (2014). DOI: 10.1073/pnas.1320040111
Klaus Krippendorff. 2004. Content Analysis: An Introduction to Its Methodology (second edition). Sage Publications.
Luiz Henrique Quevedo Lima, Adriana Silvina Pagano, and Ana Paula Couto da Silva. 2024. Toxic Content Detection in online social networks: a new dataset from Brazilian Reddit Communities. In Proceedings of the 16th International Conference on Computational Processing of Portuguese. 472–482.
M Martella, F Bert, G Colli, G Lo Moro, A Pagani, R Tatti, G Scaioli, and R Siliquini. 2021. Consequences of cyberaggression on Social Network on mental health of Italian adults. European Journal of Public Health 31 (2021). DOI: 10.1093/eurpub/ckab165.589
Philip May. 2021. Machine translated multilingual STS benchmark dataset. [link]
Negar Mokhberian, Myrl G Marmarelis, Frederic R Hopp, Valerio Basile, Fred Morstatter, and Kristina Lerman. 2023. Capturing perspectives of crowdsourced annotators in subjective learning tasks. arXiv preprint arXiv:2311.09743 (2023).
NLTK. 2023. NLTK - Sample usage for tokenize. [link] Acessado: 22-06-2024.
NLTK. 2023. NLTK - stopwords. [link] Acessado: 24-06-2024.
Joel Nothman, Nicky Ringland, Will Radford, Tara Murphy, and James R Curran. 2013. Learning multilingual named entity recognition from Wikipedia. Artificial Intelligence 194 (2013), 151–175.
Slav Petrov, Dipanjan Das, and Ryan McDonald. 2011. A universal part-of-speech tagset. arXiv preprint arXiv:1104.2086 (2011).
Scott Piao, Francesca Bianchi, Carmen Dayrell, Angela D’Egidio, and Paul Rayson. 2015. Development of the Multilingual Semantic Annotation System. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Rada Mihalcea, Joyce Chai, and Anoop Sarkar (Eds.). Association for Computational Linguistics, Denver, Colorado, 1268–1274. DOI: 10.3115/v1/N15-1137
Alexandre Rademaker, Fabricio Chalub, Livy Real, Cláudia Freitas, Eckhard Bick, and Valeria De Paiva. 2017. Universal Dependencies for Portuguese. In Proceedings of the Fourth International Conference on Dependency Linguistics (Depling). Pisa, Italy, 197–206. [link]
Livy Real, Erick Fonseca, and Hugo Goncalo Oliveira. 2020. The assin 2 shared task: a quick overview. In International Conference on Computational Processing of the Portuguese Language. Springer, 406–412.
Reddit. 2023. Transparency Report: July to December 2023. [link] Acessado: 13-06-2024.
Shabnoor Siddiqui and Tajinder Singh. 2016. Social Media its Impact with Positive and Negative Aspects. International Journal of Computer Applications Technology and Research 5 (2016), 71–75. [link]
Scott Songlin Piao, Paul Edward Rayson, Dawn Archer, Francesca Bianchi, Carmen Dayrell, Mahmoud El-Haj, Ricardo-María Jiménez-Yáñez, Dawn Knight, Michal Křen, Laura Lofberg, et al. 2016. Lexical Coverage Evaluation of Largescale Multilingual Semantic Lexicons for Twelve Languages. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (Portorož, Slovenia, 23-28). European Language Resources Association (ELRA), Paris, France.
Fábio Souza, Rodrigo Nogueira, and Roberto Lotufo. 2020. BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).
Marlo Souza and Renata Vieira. 2012. Sentiment Analysis on Twitter Data for Portuguese Language. In Computational Processing of the Portuguese Language, Helena Caseli, Aline Villavicencio, António Teixeira, and Fernando Perdigão (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 241–247.
spaCy. 2023. Portuguese Models. [link]. Acessado: 22-06-2024.
Ronald J. Tallarida and Rodney B. Murray. 1987. Mann-Whitney Test. Springer New York, New York, NY, 149–153. DOI: 10.1007/978-1-4612-4974-0_46
X. 2024. DSA Transparency Report - April 2024. [link] Acessado: 14-06-2024.
Yinfei Yang, Daniel Cer, Amin Ahmad, Mandy Guo, Jax Law, Noah Constant, Gustavo Hernandez Abrego, Steve Yuan, Chris Tar, Yun-Hsuan Sung, et al. 2019. Multilingual Universal Sentence Encoder for Semantic Retrieval. arXiv:1907.04307