An Approach for Detecting Relevant Topics in Online Social Networks
Abstract
The popularization of social networks has contributed to increase the amount of information shared daily, making these networks a source of information about diversified events. However, this information is difficult to understand, since there is a contextual diversity about these events and the high cost of processing to eliminate noises, make the process of recognition of relevant information challenging. In this context, this work propose an approach to characterize relevant information to events, through the extraction of topics in shared data on Twitter, where we used as a study scenario the phases of Lava Jato operation in 2016. In this way, we evaluated three machine learning methods (K-means, LDA and NMF) and compared pre-processing techniques for cleaning texts in order to observe if there is an improvement in algorithms performance. In addition, we use the Silhouette technique to find the best value of clusters, eliminating the need to rank relevant topics. In our results we demonstrated that our approach is able to monitor social networks to characterize events when we use NMF combined with named entity recognition.
References
Al-Rfou, R., Kulkarni, V., Perozzi, B., and Skiena, S. (2015). Polyglot-ner: Massive multilingual named entity recognition. In Proceedings of the 2015 SIAM International Conference on Data Mining, Vancouver, British Columbia, Canada. SIAM.
Allan, J., Carbonell, J. G., Doddington, G., Yamron, J., and Yang, Y. (1998). Topic detection and tracking pilot study final report.
Atefeh, F. and Khreich, W. (2013). A survey of techniques for event detection in twitter. Computational Intelligence.
Bolelli, L., Ertekin, S¸ ., and Giles, C. L. (2009). Topic and trend detection in text collections using latent dirichlet allocation. In Proceedings of the European Conference on Information Retrieval, pages 776–780. Springer.
Cataldi, M., Di Caro, L., and Schifanella, C. (2010). Emerging topic detection on twitter based on temporal and social terms evaluation. In Proceedings of the Tenth International Workshop on Multimedia Data Mining, page 4. ACM.
Diao, Q., Jiang, J., Zhu, F., and Lim, E.-P. (2012). Finding bursty topics from microblogs. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1, pages 536–544. Association for Computational Linguistics.
Katragadda, S., Virani, S., Benton, R., and Raghavan, V. (2016). Detection of event onset using twitter. In Proceedings of 2016 International Joint Conference on Neural Networks (IJCNN), pages 1539–1546. IEEE.
M. Zaharia, M. Chowdhury, M. J. F. S. S. and Stoica, I. (2010). Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing.
Ramos, P., Reis, J., and Benevenuto, F. (2016). Uma analise da polaridade expressa nas manchetes de notıcias brasileiras.
Sakaki, T., Okazaki, M., and Matsuo, Y. (2010). Earthquake shakes twitter users: realtime event detection by social sensors. In Proceedings of the 19th international conference on World wide web, pages 851–860. ACM.
Shahnaz, F., Berry, M.W., Pauca, V. P., and Plemmons, R. J. (2006). Document clustering using nonnegative matrix factorization. In Proceedings of the Information Processing & Management, volume 42, pages 373–386. Elsevier.
Souza, B. A., Almeida, T. G., Menezes, A. A., Nakamura, F. G., Figueiredo, C. M., and Nakamura, E. F. (2016). For or against?: Polarity analysis in tweets about impeachment process of brazil president. In Proceedings of the 22Nd Brazilian Symposium on Multimedia and the Web, Webmedia ’16, pages 335–338, New York, NY, USA. ACM.
Stiilpen Junior, M. and Merschmann, L. H. C. (2016). A methodology to handle social media posts in brazilian portuguese for text mining applications. In Proceedings of the 22nd Brazilian Symposium on Multimedia and the Web, pages 239–246. ACM.
Suh, S., Choo, J., Lee, J., and Reddy, C. K. (2016). L-ensnmf: Boosted local topic discovery via ensemble of nonnegative matrix factorization.
Xu,W., Liu, X., and Gong, Y. (2003). Document clustering based on non-negative matrix factorization. In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pages 267–273. ACM.
Zhao, W. X., Jiang, J., Weng, J., He, J., Lim, E.-P., Yan, H., and Li, X. (2011). Comparing twitter and traditional media using topic models. In European Conference on Information Retrieval, pages 338–349. Springer.
