Natural Language Processing and Social Media: a systematic mapping on Brazilian leading events
Resumo
The number of social media platforms has increased significantly, as well as the number of active users. More than 18.2 million text messages are transmitted every minute on these platforms. Given the amount of data available, Natural Language Processing (NLP) techniques have been used by several researchers to analyze this large amount of unstructured data. Thus, it is essential to understand social media analysis’s main trends and challenges. From this perspective, this study presents a systematic mapping of NLP for social media analysis considering papers published in five well-established academic Brazilian events: BRACIS, BraSNAM, ENIAC, STIL, and PROPOR. The study aims to identify the main tools and techniques used, tasks performed, data sources, and evaluation measures. For this purpose, 186 studies were analyzed and carefully selected among the 654 papers published in these events in the three years (2020 to 2022). The results show a glimpse of the current scenario on the subject and point out areas that can be improved in future research with techniques for tasks such as text classification, sentiment analysis, and named-entity recognition. Therefore, this work can be helpful for academics interested in exploring the potential NLP for social media analysis and having a clear view of gaps, challenges, and research opportunities in this area. Nevertheless, it should guide the productive sector in this knowledge transfer, reducing the gap between the state of the art and practice, consequently increasing the competitiveness and innovation of social media analysis tools.
Referências
Almeida, G. R., Guimarães, I., Jacob Jr, A. F., and Lobato, F.M. (2020). Fontes de dados gerados por usuários: quais plataformas considerar? In Anais do IX Brazilian Workshop on Social Network Analysis and Mining, pages 25–36. SBC.
Appel, G., Grewal, L., Hadi, R., and Stephen, A. T. (2020). The future of social media in marketing. Journal of the Academy of Marketing science, 48(1):79–95.
Aragy, R., Fernandes, E. R., and Caceres, E. N. (2021). Rhetorical role identification for portuguese legal documents. In Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Proceedings, Part II 10, pages 557–571. Springer.
Balaji, T., Annavarapu, C. S. R., and Bablani, A. (2021). Machine learning algorithms for social media analysis: A survey. Computer Science Review, 40:100395.
Batista Filho, A. P., da Conceiçao Araújo, D., Ferreira, M. A. D., and de Mattos Neto, P. S. G. (2021). Fake news detection about covid-19 in the portuguese language. In Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pages 492–503. SBC.
Berrar, D. (2019). Cross-validation. In Ranganathan, S., Gribskov, M., Nakai, K., and Schönbach, C., editors, Encyclopedia of Bioinformatics and Computational Biology, pages 542–545. Academic Press, Oxford.
Britto, L. F., Pessoa, L. A., and Agostinho, S. C. (2022). Cross-domain sentiment analysis in portuguese using bert. In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 61–72. SBC.
Carvalho, L. P., Murakami, L., Suzano, J. A., Oliveira, J., Revoredo, K., and Santoro, F. M. (2022). Ethics: What is the research scenario in the brazilian conference bracis? In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 624–635. SBC.
Charles, A. C., Ruback, L., and Oliveira, J. (2022). Fakepedia corpus: A flexible fake news corpus in portuguese. In Computational Processing of the Portuguese Language: 15th International Conference, PROPOR 2022, pages 37–45. Springer.
Choi, J., Yoon, J., Chung, J., Coh, B.-Y., and Lee, J.-M. (2020). Social media analytics and business intelligence research: A systematic review. Information Processing & Management, 57(6):102279.
Cordeiro, F., Rabelo, R. d. A. L., and Moura, R. S. (2022). Classification of irregularity communications in public ombudsmen using supervised learning algorithms. In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 704–715. SBC.
Cortiz, D., Silva, J. O., Calegari, N., Freitas, A. L., Soares, A. A., Botelho, C., Rêgo, G. G., Sampaio, W., and Boggio, P. S. (2021). A weakly supervised dataset of fine-grained emotions in portuguese. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 73–81. SBC.
de Oliveira, N. R., Pisa, P. S., Lopez, M. A., de Medeiros, D. S. V., and Mattos, D. M. (2021). Identifying fake news on social networks based on natural language processing: trends and challenges. Information, 12(1):38.
de Sousa, G. N., Guimaraes, I., Jacob Jr, A. F., and Lobato, F. M. (2020). Análise comparativa das principais plataformas de reclamações online: implicações para análise de mídia social em negócios. In Anais do IX Brazilian Workshop on Social Network Analysis and Mining, pages 154–165. SBC.
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint.
Ferraz, T. P., Alcoforado, A., Bustos, E., Oliveira, A., Gerber, R., Müller, N., d’Almeida, A. C., Veloso, B., and Costa, A. R. (2021). Debacer: a method for slicing moderated debates. In Anais do XVIII Encontro Nacional de Inteligência Artificial e Computacional, pages 667–678. Sociedade Brasileira de Computação-SBC.
Gumiel, Y. B., Lee, I., Soares, T. A., Ferreira, T. C., and Pagano, A. (2021). Sentiment analysis in portuguese texts from online health community forums: data, model and evaluation. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 64–72. SBC.
Gururangan, S., Marasović, A., Swayamdipta, S., Lo, K., Beltagy, I., Downey, D., and Smith, N. A. (2020). Don’t stop pretraining: Adapt language models to domains and tasks. arXiv preprint arXiv:2004.10964.
Hammes, L. O. A. and de Freitas, L. A. (2021). Utilizando bertimbau para a classificação de emoções em português. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 56–63. SBC.
Hassani, A. and Mosconi, E. (2022). Social media analytics, competitive intelligence, and dynamic capabilities in manufacturing smes. Technological Forecasting and Social Change, 175:121416.
He, W., Zhang, W., Tian, X., Tao, R., and Akula, V. (2019). Identifying customer knowledge on social media through data analytics. Journal of Enterprise Information Management, 32(1):152–169.
Hirschberg, J. and Manning, C. D. (2015). Advances in natural language processing. Science, 349(6245):261–266.
Hou, Q., Han, M., and Cai, Z. (2020). Survey on data analysis in social media: A practical application aspect. Big Data Mining and Analytics, 3(4):259–279.
Júnior, E. G. S. L., de Sousa, G. N., Junior, A. F. L. J., and Lobato, F. M. F. (2020). Ferramentas para análise de mídias sociais: Um levantamento sistemático. Anais do Computer on the Beach, 11(1):389–396.
Kaplan, A. M. and Haenlein, M. (2010). Users of the world, unite! the challenges and opportunities of social media. Business horizons, 53(1):59–68.
Khurana, D., Koli, A., Khatter, K., and Singh, S. (2023). Natural language processing: State of the art, current trends and challenges. Multimedia tools and applications.
Kitchenham, B., Charters, S., et al. (2007). Guidelines for performing systematic literature reviews in software engineering.
Laender, A. H. F., Medeiros, C. M. B., Cendes, I. L., Barreto, M. L., Van Sluys, M.-A., Almeida, U. B. d., et al. (2020). Abertura e gestão de dados: desafios para a ciência brasileira.
Lequertier, V., Wang, T., Fondrevelle, J., Augusto, V., and Duclos, A. (2021). Hospital length of stay prediction methods: a systematic review. Medical Care.
Lobato, F. M., de Sousa, G. C., and Jacob Jr, A. F. (2021). Brasnam em perspectiva: uma análise da sua trajetória até os 10 anos de existência. In Anais do X Brazilian Workshop on Social Network Analysis and Mining, pages 217–228. SBC.
Lochter, J. V., Silva, R. M., and Almeida, T. A. (2020). Deep learning models for representing out-of-vocabulary words. In Intelligent Systems: 9th Brazilian Conference, BRACIS, Proceedings, Part I, pages 418–434. Springer.
Mirzaalian, F. and Halpenny, E. (2019). Social media analytics in hospitality and tourism: A systematic literature review and future trends. Journal of Hospitality and Tourism Technology, 10(4):764–790.
Nanclarez, R. G., Roman, N. T., and da Silva, F. J. (2022). Generalizing over data sets: a preliminary study with bert for natural language inference. In Anais do XIX Encontro Nacional de Inteligência Artificial e Computacional, pages 602–611. SBC.
Pachucki, C., Grohs, R., and Scholl-Grissemann, U. (2022). Is nothing like before? covid-19–evoked changes to tourism destination social media communication. Journal of Destination Marketing & Management, 23:100692.
Pardo, T., Gasperin, C., de Medeiros Caseli, H., and Nunes, M. d. G. V. (2010). Computational linguistics in brazil: an overview. In Proceedings of the NAACL HLT 2010 Young Investigators Workshop on Computational Approaches to Languages of the Americas.
Pedroso, P. M., Lobato, F. M., de JV Sá, E., and Jacob, A. F. (2022). Handling out of vocabulary words at the semantical level using recurrent neural networks. In 2022 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology (WI-IAT), pages 88–94. IEEE.
Pelissari, R., Alencar, P. S., Amor, S. B., and Duarte, L. T. (2022). The use of multiple criteria decision aiding methods in recommender systems: A literature review. In Brazilian Conference on Intelligent Systems, pages 535–549. Springer.
Rosen, A. O., Holmes, A. L., Balluerka, N., Hidalgo, M. D., Gorostiaga, A., Gómez-Benito, J., and Huedo-Medina, T. B. (2022). Is social media a new type of social support? social media use in spain during the covid-19 pandemic: A mixed methods study. International Journal of Environmental Research and Public Health, 19(7):3952.
Serras, F. R. and Finger, M. (2021). verbert: Automating brazilian case law document multi-label categorization using bert. In Anais do XIII Simpósio Brasileiro de Tecnologia da Informação e da Linguagem Humana, pages 237–246. SBC.
Sinoara, R. A., Antunes, J., and Rezende, S. O. (2017). Text mining and semantics: a systematic mapping study. Journal of the Brazilian Computer Society, 23:1–20.
Souza, E., Costa, D., Castro, D. W., Vitório, D., Teles, I., Almeida, R., Alves, T., Oliveira, A. L., and Gusmão, C. (2018). Characterising text mining: a systematic mapping review of the portuguese language. IET Software, 12(2):49–75.
Souza, F., Nogueira, R., and Lotufo, R. (2020). Bertimbau: pretrained bert models for brazilian portuguese. In Intelligent Systems: 9th Brazilian Conference. Springer.
Vitório, D., Albuquerque, H. O., Souza, E. P. R., Oliveira, A. L. I. d., Barros, F., and Prudêncio, R. B. (2022). Análise do posicionamento dos usuários do twitter acerca da vacinaçao infantil contra a covid-19 no brasil. Anais.
Zachlod, C., Samuel, O., Ochsner, A., and Werthmüller, S. (2022). Analytics of social media data–state of characteristics and application. Journal of Business Research.
Zhang, C. and Lu, Y. (2021). Study on artificial intelligence: The state of the art and future prospects. Journal of Industrial Information Integration, 23:100224.