Phishing Guardian: Detection of Phishing Sites Using Machine Learning
Abstract
Phishing remains one of the cyber threats with the greatest financial and social impact. This paper investigates the effectiveness of Machine Learning techniques in detecting malicious URLs, addressing shortcomings related to incomplete databases and systematic comparisons between algorithms. It uses a database of 50,261 URLs (55.5% malicious) collected from public sources and active scanning. The Random Forest, XGBoost and SVM algorithms are trained with cross-validation, with XGBoost achieving 99.51% accuracy. A tool was developed that contains the classifier and a browser extension that displays non-intrusive alerts to the user, in order to guarantee a good user experience.References
Abu-Nimeh, S., Nappa, D., Wang, X., and Nair, S. (2007). A comparison of machine learning techniques for phishing detection. In Proceedings of the Anti-Phishing Working Groups 2nd Annual ECrime Researchers Summit, pages 60–69.
Ahmad, R., Alsmadi, I., Alhamdani, W., and Tawalbeh, L. (2023). Zero-day attack detection: a systematic literature review. Artificial Intelligence Review, 56(10):10733–10811.
Al Saidat, M. R., Yerima, S. Y., and Shaalan, K. (2024). Advancements of SMS spam detection: A comprehensive survey of NLP and ML techniques. Procedia Computer Science, 244:248–259.
Alanezi, M. (2021). Phishing detection methods: A review. Technium.
Alawida, M., Omolara, A. E., Abiodun, O. I., and Al-Rajab, M. (2022). A deeper look into cybersecurity issues in the wake of Covid-19: A survey. Journal of King Saud University-Computer and Information Sciences, 34(10):8176–8206.
Bhattacharya, T., Veeramalla, S., and Tanniru, V. (2023). A survey on retrieving confidential data using phishing attack. In 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), pages 2528–2535. IEEE.
Castaño, F., Fernañdez, E. F., Alaiz-Rodríguez, R., and Alegre, E. (2023). Phikita: Phishing kit attacks dataset for phishing websites identification. IEEE Access, 11:40779–40789.
Cisco Systems, Inc. Phishtank. Disponível em [link]. Acessado em maio de 2024.
Correia, P. H. B. and Pedrini, H. (2020). Detecção de domínios maliciosos baseada em técnicas de aprendizado de máquina. Trabalho de conclusão de curso (Bacharelado em Ciência da Computação). Universidade Estadual de Campinas.
Fajar, A., Yazid, S., and Budi, I. (2024). Enhancing phishing detection through feature importance analysis and explainable AI: A comparative study of CatBoost, XGBoost, and EBM models. arXiv.
Guarizi, B. D. and Mascarenhas, D. M. (2024). Identificação de ataques de phishing através de machine learning. In Anais Estendidos do XXIV Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais. SBC.
Guarizi, B. D. and Mascarenhas, D. M. (2025). Phishing guardian. Disponível em [link]. Acessado em julho de 2025.
Guo, Y. (2023). A review of machine learning-based zero-day attack detection: Challenges and future directions. Computer communications, 198:175–185.
Kamal, A. H. A., Yen, C. C. Y., Ping, M. H., and Zahra., F. (2020). Cybersecurity issues and challenges during Covid-19 pandemic. Preprints.org.
Kim, J., Kim, J., Wi, S., Kim, Y., and Son, S. (2022). Hearmeout: detecting voice phishing activities in Android. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, pages 422–435.
Kyriazoglou, J. (2024). Information security and breach definitions and obligations. In Information Security Incident and Data Breach Management: A Step-by-Step Approach, pages 1–14. Springer.
Mohammad, R. M., Thabtah, F., and McCluskey, L. (2012). An assessment of features related to phishing websites using an automated technique. In 2012 international conference for internet technology and secured transactions, pages 492–497. IEEE.
OpenPhish. Openphish. Disponível em [link]. Acessado em maio de 2024.
Pranggono, B. and Arabo, A. (2021). Covid-19 pandemic cybersecurity issues. Internet Technology Letters, 4(2):e247.
Sadaf, K. (2023). Phishing website detection using XGBoost and Catboost classifiers. In 2023 International Conference on Smart Computing and Application (ICSCA). IEEE.
Safi, A. and Singh, S. (2023). A systematic literature review on phishing website detection techniques. In Journal of King Saud University-Computer and Information Sciences, pages 590–611.
Salloum, S., Gaber, T., Vadera, S., and Shaalan, K. (2022). A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access, 10:65703–65727.
Sharevski, F., Devine, A., Pieroni, E., and Jachim, P. (2022). Phishing with malicious qr codes. In Proceedings of the 2022 European Symposium on Usable Security, pages 160–171.
Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., and Zhang, C. (2009). An empirical analysis of phishing blacklists.
Singh, T., Kumar, M., and Kumar, S. (2024). Walkthrough phishing detection techniques. Computers and Electrical Engineering, 118:109374.
Souza, J. A. and Mascarenhas, D. M. (2023). Detecção de ataques de phishing em tempo real utilizando algoritmos de aprendizado de máquina. In Anais Estendidos do XXIII Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais. SBC.
Sullivan, B. Phishing costs have tripled since 2015. Disponível em [link]. Acessado em janeiro de 2025.
Ahmad, R., Alsmadi, I., Alhamdani, W., and Tawalbeh, L. (2023). Zero-day attack detection: a systematic literature review. Artificial Intelligence Review, 56(10):10733–10811.
Al Saidat, M. R., Yerima, S. Y., and Shaalan, K. (2024). Advancements of SMS spam detection: A comprehensive survey of NLP and ML techniques. Procedia Computer Science, 244:248–259.
Alanezi, M. (2021). Phishing detection methods: A review. Technium.
Alawida, M., Omolara, A. E., Abiodun, O. I., and Al-Rajab, M. (2022). A deeper look into cybersecurity issues in the wake of Covid-19: A survey. Journal of King Saud University-Computer and Information Sciences, 34(10):8176–8206.
Bhattacharya, T., Veeramalla, S., and Tanniru, V. (2023). A survey on retrieving confidential data using phishing attack. In 2023 Congress in Computer Science, Computer Engineering, & Applied Computing (CSCE), pages 2528–2535. IEEE.
Castaño, F., Fernañdez, E. F., Alaiz-Rodríguez, R., and Alegre, E. (2023). Phikita: Phishing kit attacks dataset for phishing websites identification. IEEE Access, 11:40779–40789.
Cisco Systems, Inc. Phishtank. Disponível em [link]. Acessado em maio de 2024.
Correia, P. H. B. and Pedrini, H. (2020). Detecção de domínios maliciosos baseada em técnicas de aprendizado de máquina. Trabalho de conclusão de curso (Bacharelado em Ciência da Computação). Universidade Estadual de Campinas.
Fajar, A., Yazid, S., and Budi, I. (2024). Enhancing phishing detection through feature importance analysis and explainable AI: A comparative study of CatBoost, XGBoost, and EBM models. arXiv.
Guarizi, B. D. and Mascarenhas, D. M. (2024). Identificação de ataques de phishing através de machine learning. In Anais Estendidos do XXIV Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais. SBC.
Guarizi, B. D. and Mascarenhas, D. M. (2025). Phishing guardian. Disponível em [link]. Acessado em julho de 2025.
Guo, Y. (2023). A review of machine learning-based zero-day attack detection: Challenges and future directions. Computer communications, 198:175–185.
Kamal, A. H. A., Yen, C. C. Y., Ping, M. H., and Zahra., F. (2020). Cybersecurity issues and challenges during Covid-19 pandemic. Preprints.org.
Kim, J., Kim, J., Wi, S., Kim, Y., and Son, S. (2022). Hearmeout: detecting voice phishing activities in Android. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services, pages 422–435.
Kyriazoglou, J. (2024). Information security and breach definitions and obligations. In Information Security Incident and Data Breach Management: A Step-by-Step Approach, pages 1–14. Springer.
Mohammad, R. M., Thabtah, F., and McCluskey, L. (2012). An assessment of features related to phishing websites using an automated technique. In 2012 international conference for internet technology and secured transactions, pages 492–497. IEEE.
OpenPhish. Openphish. Disponível em [link]. Acessado em maio de 2024.
Pranggono, B. and Arabo, A. (2021). Covid-19 pandemic cybersecurity issues. Internet Technology Letters, 4(2):e247.
Sadaf, K. (2023). Phishing website detection using XGBoost and Catboost classifiers. In 2023 International Conference on Smart Computing and Application (ICSCA). IEEE.
Safi, A. and Singh, S. (2023). A systematic literature review on phishing website detection techniques. In Journal of King Saud University-Computer and Information Sciences, pages 590–611.
Salloum, S., Gaber, T., Vadera, S., and Shaalan, K. (2022). A systematic literature review on phishing email detection using natural language processing techniques. IEEE Access, 10:65703–65727.
Sharevski, F., Devine, A., Pieroni, E., and Jachim, P. (2022). Phishing with malicious qr codes. In Proceedings of the 2022 European Symposium on Usable Security, pages 160–171.
Sheng, S., Wardman, B., Warner, G., Cranor, L., Hong, J., and Zhang, C. (2009). An empirical analysis of phishing blacklists.
Singh, T., Kumar, M., and Kumar, S. (2024). Walkthrough phishing detection techniques. Computers and Electrical Engineering, 118:109374.
Souza, J. A. and Mascarenhas, D. M. (2023). Detecção de ataques de phishing em tempo real utilizando algoritmos de aprendizado de máquina. In Anais Estendidos do XXIII Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais. SBC.
Sullivan, B. Phishing costs have tripled since 2015. Disponível em [link]. Acessado em janeiro de 2025.
Published
2025-09-01
How to Cite
GUARIZI, Bianca Domingos; MASCARENHAS, Dalbert Matos; MORAES, Igor Monteiro.
Phishing Guardian: Detection of Phishing Sites Using Machine Learning. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 693-709.
DOI: https://doi.org/10.5753/sbseg.2025.11491.
