A Multi-label Classification System to Distinguish among Fake, Satirical, Objective and Legitimate News in Brazilian Portuguese
Keywords:Fake News, Decision Support System, Text Mining, Multi-Label
Currently, there has been a significant increase in the diffusion of fake news worldwide, especially the political class, where the possible misinformation that can be propagated, appearing at the elections debates around the world. However, news with a recreational purpose, such as satirical news, is often confused with objective fake news. In this work, we decided to address the differences between objectivity and legitimacy of news documents, where each article is treated as belonging to two conceptual classes: objective/satirical and legitimate/fake. Therefore, we propose a DSS (Decision Support System) based on a Text Mining (TM) pipeline with a set of novel textual features using multi-label methods for classifying news articles on these two domains. For this, a set of multi-label methods was evaluated with a combination of different base classifiers and then compared with a multi-class approach. Also, a set of real-life news data was collected from several Brazilian news portals for these experiments. Results obtained reported our DSS as adequate (0.80 f1-score) when addressing the scenario of misleading news, challenging the multi-label perspective, where the multi-class methods (0.01 f1-score) overcome by the proposed method. Moreover, it was analyzed how each stylometric features group used in the experiments influences the result aiming to discover if a particular group is more relevant than others. As a result, it was noted that the complexity group of features could be more relevant than others.
Allcott, H. and Gentzkow, M. (2017). Social media and fake news in the 2016 election. Journal of Economic Perspectives, 31(2):211–36.
Almeida, A. M., Cerri, R., Paraiso, E. C., Mantovani, R. G., and Junior, S. B. (2018). Applying multi-label techniques in emotion identification of short texts. Neurocomputing, 320:35–46.
Barrios, F., L´opez, F., Argerich, L., and Wachenchauzer, R. (2016). Variations of the similarity function of textrank for automated summarization. arXiv preprint arXiv:1602.03606.
Batista, G. E., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1):20–29.
Bell, A. (1991). The language of news media. Blackwell Oxford.
Bhowmick, P. K. (2009). Reader perspective emotion analysis in text through ensemble based multi-label classification framework. Computer and Information Science, 2(4):64.
Bird, S., Klein, E., and Loper, E. (2009). Natural language processing with Python: analyzing text with the natural language toolkit. ”O’Reilly Media, Inc.”.
Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32.
Castillo, C., Mendoza, M., and Poblete, B. (2013). Predicting information credibility in time-sensitive social media. Internet Research, 23(5):560–588.
Chen, Y., Conroy, N. J., and Rubin, V. L. (2015). Misleading online content: Recognizing clickbait as false news. In Proceedings of the 2015 ACM on Workshop on Multimodal Deception Detection, pages 15–19. ACM.
Collins, M. (2002). Discriminative training methods for hidden markov models: Theory and experiments with perceptron algorithms. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing-Volume 10, pages 1–8. Association for Computational Linguistics.
Conroy, N. J., Rubin, V. L., and Chen, Y. (2015). Automatic deception detection: Methods for finding fake news. In Proceedings of the 78th ASIS&T Annual Meeting: Information Science with Impact: Research in and for the Community, page 82. American Society for Information Science.
Cortes, C. and Vapnik, V. (1995). Support-vector networks. Machine learning, 20(3):273– 297.
de Morais, J. I., Abonizio, H. Q., Tavares, G. M., da Fonseca, A. A., and Barbon Jr, S. (2019). Deciding among fake, satirical, objective and legitimate news: A multi-label classification system. In Proceedings of the XV Brazilian Symposium on Information Systems, page 22. ACM.
Dillard, J. P. and Pfau, M. (2002). The persuasion handbook: Developments in theory and practice. Sage Publications.
Fawcett, T. (2006). An introduction to roc analysis. Pattern recognition letters, 27(8):861–874.
Fonseca, E. R., Rosa, J. L. G., and Alu´ısio, S. M. (2015). Evaluating word embedding and a revised corpus for part-of-speech tagging in portuguese. Journal of the Brazilian Computer Society, 21(1):2.
Genuer, R., Poggi, J.-M., and Tuleau-Malot, C. (2010). Variable selection using random forests. Pattern Recognition Letters, 31(14):2225–2236.
Gonz´alez-Ib´anez, R., Muresan, S., and Wacholder, N. (2011). Identifying sarcasm in twitter: a closer look. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies: Short Papers-Volume 2, pages 581–586. Association for Computational Linguistics.
Horne, B. D. and Adali, S. (2017). This just in: fake news packs a lot in title, uses simpler, repetitive content in text body, more similar to satire than real news. arXiv preprint arXiv:1703.09398.
Igawa, R. A., Kido, G. S., Seixas, J. L., and Barbon, S. (2014). Adaptive distribution of vocabulary frequencies: A novel estimation suitable for social media corpus. In Intelligent Systems (BRACIS), 2014 Brazilian Conference on, pages 282–287. IEEE.
Ishita, E., Oard, D. W., Fleischmann, K. R., Cheng, A.-S., and Templeton, T. C. (2010). Investigating multi-label classification for human values. Proceedings of the American Society for Information Science and Technology, 47(1):1–4.
Kohavi, R. et al. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, volume 14, pages 1137–1145. Montreal, Canada.
Kress, G. (2003). Literacy in the new media age. Routledge.
Lazer, D. M., Baum, M. A., Benkler, Y., Berinsky, A. J., Greenhill, K. M., Menczer, F., Metzger, M. J., Nyhan, B., Pennycook, G., Rothschild, D., et al. (2018). The science of fake news. Science, 359(6380):1094–1096.
Leech, G. and Weisser, M. (2003). Generic speech act annotation for task-oriented dialogues. In Proceedings of the corpus linguistics 2003 conference, volume 16. Lancaster: Lancaster University.
Li, X., Xie, H., Rao, Y., Chen, Y., Liu, X., Huang, H., and Wang, F. L. (2016). Weighted multi-label classification model for sentiment analysis of online news. In Big Data and Smart Computing (BigComp), 2016 International Conference on, pages 215–222. IEEE.
Lynch, G. and Vogel, C. (2018). The translator’s visibility: Detecting translatorial fingerprints in contemporaneous parallel translations. Computer Speech & Language, 52:79 – 104.
Maaten, L. v. d. and Hinton, G. (2008). Visualizing data using t-sne. Journal of machine learning research, 9(Nov):2579–2605.
McCombs, M. E. and Shaw, D. L. (1972). The agenda-setting function of mass media. Public opinion quarterly, 36(2):176–187.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and Dean, J. (2013). Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems, pages 3111–3119.
Olson, D. L. and Delen, D. (2008). Advanced data mining techniques. Springer Science & Business Media.
Pariser, E. (2011). The filter bubble: What the Internet is hiding from you. Penguin UK.
Piskorski, J., Sydow, M., and Weiss, D. (2008). Exploring linguistic features for web spam detection: a preliminary study. In Proceedings of the 4th international workshop on Adversarial information retrieval on the web, pages 25–28. ACM.
Poria, S., Cambria, E., Hazarika, D., and Vij, P. (2016). A deeper look into sarcastic tweets using deep convolutional neural networks. arXiv preprint arXiv:1610.08815.
Qin, T., Burgoon, J. K., Blair, J. P., and Nunamaker, J. F. (2005). Modality effects in deception detection and applications in automatic-deception-detection. In Proceedings of the 38th annual Hawaii international conference on system sciences, pages 23b– 23b. IEEE.
Reganti, A., Maheshwari, T., Das, A., and Cambria, E. (2017). Open secrets and wrong rights: automatic satire detection in english text. In Companion of the 2017 ACM Conference on Computer Supported Cooperative Work and Social Computing, pages 291–294. ACM.
Rubin, V., Conroy, N., Chen, Y., and Cornwell, S. (2016). Fake news or truth? using satirical cues to detect potentially misleading news. In Proceedings of the Second Workshop on Computational Approaches to Deception Detection, pages 7–17.
Ruchansky, N., Seo, S., and Liu, Y. (2017). Csi: A hybrid deep model for fake news detection. In Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pages 797–806. ACM.
Saif, H., Fern´andez, M., He, Y., and Alani, H. (2014). On stopwords, filtering and data sparsity for sentiment analysis of twitter. Ninth International Conference on Language Resources and Evaluation, pages 810—-817.
Shao, C., Ciampaglia, G. L., Varol, O., Flammini, A., and Menczer, F. (2017). The spread of fake news by social bots. arXiv preprint arXiv:1707.07592.
Shoemaker, P. J. and Reese, S. D. (2013). Mediating the message in the 21st century: A media sociology perspective. Routledge.
Shu, K., Cui, L., Wang, S., Lee, D., and Liu, H. (2019a). defend: Explainable fake news detection. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD ’19, pages 395–405, New York, NY, USA. ACM.
Shu, K., Mahudeswaran, D., and Liu, H. (2019b). Fakenewstracker: a tool for fake news collection, detection, and visualization. Computational and Mathematical Organization Theory, 25(1):60–71.
Shu, K., Sliva, A., Wang, S., Tang, J., and Liu, H. (2017). Fake news detection on social media: A data mining perspective. ACM SIGKDD Explorations Newsletter, 19(1):22–36.
Singhania, S., Fernandez, N., and Rao, S. (2017). 3han: A deep neural network for fake news detection. In International Conference on Neural Information Processing, pages 572–581. Springer.
Sorower, M. S. (2010). A literature survey on algorithms for multi-label learning. Oregon State University, Corvallis, 18.
Tayal, D. K., Yadav, S., Gupta, K., Rajput, B., and Kumari, K. (2014). Polarity detection of sarcastic political tweets. In Computing for Sustainable Global Development (INDIACom), 2014 International Conference on, pages 625–628. IEEE.
Tsoumakas, G. and Katakis, I. (2007). Multi-label classification: An overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3):1–13.
Zhang, M.-L. and Zhou, Z.-H. (2005). A k-nearest neighbor based algorithm for multilabel classification. In Granular Computing, 2005 IEEE International Conference on, volume 2, pages 718–721. IEEE.
Zhang, M.-L. and Zhou, Z.-H. (2014). A review on multi-label learning algorithms. IEEE transactions on knowledge and data engineering, 26(8):1819–1837.
Zhou, L., Burgoon, J. K., Nunamaker, J. F., and Twitchell, D. (2004). Automating linguistics-based cues for detecting deception in text-based asynchronous computermediated communications. Group Decision and Negotiation, 13(1):81–106.