Analysis and Classification of SPAM Emails Using Machine Learning
Abstract
This paper proposes the development of a Machine Learning model to classify e-mails as SPAM or HAM (not SPAM). Considering the growing relevance of unwanted e-mails in global data traffic, the research aims to develop a solution that optimizes computational resources and supports digital marketing companies. Using Python, the study applies Logistic Regression to analyze e-mail content and compares its performance with the well-established Naı̈ve Bayes classifier. The study seeks to promote more sustainable digital marketing practices by reducing unwanted communications and preserving sender reputation.References
Ali, A., Bin Faheem, Z., Waseem, M., Draz, U., Safdar, Z., Hussain, S., and Yaseen, S. (2020). Systematic review: A state of art ml based clustering algorithms for data mining. In 2020 IEEE 23rd International Multitopic Conference (INMIC), pages 1–6.
Almeida, T. and Hidalgo, J. (2012). Sms spam collection dataset. Último acesso em: 10/02/2025.
Cappy, P. (2024). Email sending reputation: How does domain reputation work? Último acesso em: 26/10/2024.
Dada, E. G., Bassi, J. S., Chiroma, H., Adetunmbi, A. O., Ajibuwa, O. E., et al. (2019). Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6).
Dossetto, F. (2022). Domain reputation, explained. Último acesso em: 03/01/2025.
Garnepudi, V. (2019). Spam mails dataset. Último acesso em: 10/02/2025.
Goldschmidt, R. (2015). Data Mining. GEN LTC, Rio de Janeiro, RJ, BRA, 2ª edition.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer.
Jayapandian, N. et al. (2023). Machine learning based spam e-mail detection using logistic regression algorithm. In 2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG), pages 1–6. IEEE.
Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., USA.
Kosinski, M. (2024). What is phishing? Último acesso em: 27/10/2024.
Kuchipudi, B., Nannapaneni, R. T., and Liao, Q. (2020). Adversarial machine learning for spam filters. In Proceedings of the 15th International Conference on Availability, Reliability and Security, pages 1–6.
Mariano, D. C. B., Marques, L. T., and Silva, M. S. (2021). Data Mining. SAGAH, Porto Alegre, RS, BRA.
Martins, J. S., Lenz, M. L., Silva, M. B. F. d., Oliveira, R. A. d., Pichetti, R. F., Mariano, D. C. B., Martins, J. V., Rodrigues, S. M. A. F., and Bezerra, W. R. (2020). Processamentos de linguagem natural. SAGAH, Porto Alegre.
Pronnus (2024). Segurança digital: Você sabe a diferença entre phishing e spoofing? Último acesso em: 27/10/2024.
Sarica, S. and Luo, J. (2021). Stopwords in technical language processing. PLOS ONE, 16(8):1–13.
Sherwin, R. (2023). Report spam, misclassified, viral email messages. Último acesso em: 27/10/2024.
Statista (2024a). Daily number of emails sent worldwide as of april 2024 by country. Último acesso em: 26/10/2024.
Statista (2024b). Daily number of spam emails sent worldwide as of august 2024, by country. Último acesso em: 26/10/2024.
Tan, P.-N., Steinbach, M., and Kumar, V. (2009). Introdução ao Data Mining - Mineração de Dados. Editora Ciência Moderna Ltda., Rio de Janeiro, RJ, 1ª edition.
Yaseen, Q. et al. (2021). Spam email detection using deep learning techniques. Procedia Computer Science, 184:853–858.
Almeida, T. and Hidalgo, J. (2012). Sms spam collection dataset. Último acesso em: 10/02/2025.
Cappy, P. (2024). Email sending reputation: How does domain reputation work? Último acesso em: 26/10/2024.
Dada, E. G., Bassi, J. S., Chiroma, H., Adetunmbi, A. O., Ajibuwa, O. E., et al. (2019). Machine learning for email spam filtering: review, approaches and open research problems. Heliyon, 5(6).
Dossetto, F. (2022). Domain reputation, explained. Último acesso em: 03/01/2025.
Garnepudi, V. (2019). Spam mails dataset. Último acesso em: 10/02/2025.
Goldschmidt, R. (2015). Data Mining. GEN LTC, Rio de Janeiro, RJ, BRA, 2ª edition.
Hastie, T., Tibshirani, R., and Friedman, J. (2001). The Elements of Statistical Learning. Springer Series in Statistics. Springer New York Inc., New York, NY, USA.
James, G., Witten, D., Hastie, T., and Tibshirani, R. (2013). An Introduction to Statistical Learning: with Applications in R. Springer.
Jayapandian, N. et al. (2023). Machine learning based spam e-mail detection using logistic regression algorithm. In 2023 IEEE International Conference on ICT in Business Industry & Government (ICTBIG), pages 1–6. IEEE.
Jurafsky, D. and Martin, J. H. (2009). Speech and Language Processing (2nd Edition). Prentice-Hall, Inc., USA.
Kosinski, M. (2024). What is phishing? Último acesso em: 27/10/2024.
Kuchipudi, B., Nannapaneni, R. T., and Liao, Q. (2020). Adversarial machine learning for spam filters. In Proceedings of the 15th International Conference on Availability, Reliability and Security, pages 1–6.
Mariano, D. C. B., Marques, L. T., and Silva, M. S. (2021). Data Mining. SAGAH, Porto Alegre, RS, BRA.
Martins, J. S., Lenz, M. L., Silva, M. B. F. d., Oliveira, R. A. d., Pichetti, R. F., Mariano, D. C. B., Martins, J. V., Rodrigues, S. M. A. F., and Bezerra, W. R. (2020). Processamentos de linguagem natural. SAGAH, Porto Alegre.
Pronnus (2024). Segurança digital: Você sabe a diferença entre phishing e spoofing? Último acesso em: 27/10/2024.
Sarica, S. and Luo, J. (2021). Stopwords in technical language processing. PLOS ONE, 16(8):1–13.
Sherwin, R. (2023). Report spam, misclassified, viral email messages. Último acesso em: 27/10/2024.
Statista (2024a). Daily number of emails sent worldwide as of april 2024 by country. Último acesso em: 26/10/2024.
Statista (2024b). Daily number of spam emails sent worldwide as of august 2024, by country. Último acesso em: 26/10/2024.
Tan, P.-N., Steinbach, M., and Kumar, V. (2009). Introdução ao Data Mining - Mineração de Dados. Editora Ciência Moderna Ltda., Rio de Janeiro, RJ, 1ª edition.
Yaseen, Q. et al. (2021). Spam email detection using deep learning techniques. Procedia Computer Science, 184:853–858.
Published
2025-04-23
How to Cite
BATISTELLA, João Vítor; VIEIRA, Andrws Aires.
Analysis and Classification of SPAM Emails Using Machine Learning. In: REGIONAL DATABASE SCHOOL (ERBD), 20. , 2025, Florianópolis/SC.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 30-39.
ISSN 2595-413X.
DOI: https://doi.org/10.5753/erbd.2025.6747.
