Automatic Credibility Inference of Websites: Domain Features and Geolocation to Combat Fake News

  • Marcos Paulo Cezar de Mendonça UFF
  • Igor Monteiro Moraes UFF
  • Diogo Menezes Ferrazani Mattos UFF

Abstract


Evaluating the credibility of websites that propagate news is a critical activity in combating disinformation. Websites of low reliability are sometimes pointed out as the origin of fake news propagated and amplified on social networks. This article proposes an automatic evaluation of the credibility of websites, without the need to scan all the site’s content. Unlike previous works focusing on social networks, this article uses publicly available features of websites, such as domain characteristics, geolocation, and TLS certificate, to identify reliable and unreliable websites, using supervised machine learning techniques. The article proposes a supervised learning model and consolidates a dataset of reliable and unreliable sites. The model was trained and evaluated with disjoint data and it was possible to effectively identify, with an accuracy greater than 75%, reliable and unreliable websites, contributing to the fight against the spread of fake news and disinformation.

References

Ahammad, S. H., Kale, S. D., Upadhye, G. D., Pande, S. D., Babu, E. V., Dhumane, A. V. e Bahadur, M. D. K. J. (2022). Phishing url detection using machine learning methods. Advances in Engineering Software, 173:103288.

Al-Shehari, T. e Alsowail, R. A. (2021). An insider data leakage detection using one-hot encoding, synthetic minority oversampling and machine learning techniques. Entropy, 23(10):1258.

Alkawaz, M. H., Steven, S. J., Hajamydeen, A. I. e Ramli, R. (2021). A comprehensive survey on identification and analysis of phishing website based on machine learning methods. Em 2021 IEEE 11th IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), p. 82–87. IEEE.

Baly, R., Karadzhov, G., Alexandrov, D., Glass, J. e Nakov, P. (2018). Predicting factuality of reporting and bias of news media sources. arXiv preprint arXiv:1810.01765.

Cordeiro, A., de Oliveira Sampaio, J. e Ruback, L. (2020). Fakespread: Um framework para análise de propagação de fake news na web. Em Anais do XI Workshop sobre Aspectos da Interação Humano-Computador Para a Web Social, p. 9–16. SBC.

Couto, J. M., Reis, J. C., Cunha, Í., Araújo, L. e Benevenuto, F. (2022). Caracterizando websites de baixa credibilidade no Brasil. Em Anais do XL Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos, p. 503–516. SBC.

de Oliveira, N. R., Pisa, P. S., Lopez, M. A., de Medeiros, D. S. V. e Mattos, D. M. F. (2021). Identifying fake news on social networks based on natural language processing: Trends and challenges. Information, 12(1).

Do Xuan, C., Nguyen, H. D. e Tisenko, V. N. (2020). Malicious url detection based on machine learning. International Journal of Advanced Computer Science and Applications, 11(1).

Fisher, T. (2023). What are hops & hop counts?: What is a hop and why is it an important piece of information?

Hua, J., Cui, X., Li, X., Tang, K. e Zhu, P. (2023). Multimodal fake news detection through data augmentation-based contrastive learning. Applied Soft Computing, 136:110125.

Mahajan, R. e Siddavatam, I. (2018). Phishing website detection using machine learning algorithms. International Journal of Computer Applications, 181(23):45–47.

Nemer, D. (2020). Desinformação no contexto da pandemia do coronavírus (covid-19). AtoZ: novas práticas em informação e conhecimento, 9(2):113–116.

Palaniappan, G., Sangeetha, S., Rajendran, B., Goyal, S., Bindhumadhava, B. et al. (2020). Malicious domain detection using machine learning on domain name features, host-based features and web-based features. Procedia Computer Science, 171:654–661.

Posetti, J. e Matthews, A. (2018). A short guide to the history of ‘fake news’ and disinformation. International Center for Journalists, 7(2018).

Reis, J. C., Correia, A., Murai, F., Veloso, A. e Benevenuto, F. (2019). Supervised learning for fake news detection. IEEE Intelligent Systems, 34(2):76–81.

Saleem Raja, A., Vinodini, R. e Kavitha, A. (2021). Lexical features based malicious url detection using machine learning techniques. Materials Today: Proceedings, 47:163–166. NCRABE.

Santos, W. R., Xavier, M. R., da Cunha, D. C., Júnior, J. C., Adauto, D. A. e Ferraz, C. A. (2019). Trendsbot: Verificando a veracidade das mensagens do telegram utilizando data stream. Em Anais Estendidos do XXXVII Simpósio Brasileiro de Redes de Computadores e Sistemas Distribuídos, p. 65–72. SBC.

Schwittmann, L., Wander, M. e Weis, T. (2019). Domain impersonation is feasible: A study of ca domain validation vulnerabilities. Em 2019 IEEE European Symposium on Security and Privacy (EuroS&P), p. 544–559.

Sen, P. C., Hajra, M. e Ghosh, M. (2020). Supervised classification algorithms in machine learning: A survey and review. Em Emerging Technology in Modelling and Graphics: Proceedings of IEM Graph 2018, p. 99–111. Springer.

Wardle, C. e Derakhshan, H. (2017). Information disorder: Toward an interdisciplinary framework for research and policymaking, volume 27. Council of Europe Strasbourg.
Published
2024-07-21
MENDONÇA, Marcos Paulo Cezar de; MORAES, Igor Monteiro; MATTOS, Diogo Menezes Ferrazani. Automatic Credibility Inference of Websites: Domain Features and Geolocation to Combat Fake News. In: WORKSHOP ON PERFORMANCE OF COMPUTER AND COMMUNICATION SYSTEMS (WPERFORMANCE), 23. , 2024, Brasília/DF. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 61-72. ISSN 2595-6167. DOI: https://doi.org/10.5753/wperformance.2024.2722.