A New Approach for Detecting Fake SMTP Headers Using Deep Learning and Synthetic Data Generation
Abstract
This work proposes an new approach for detecting anomalous e-mail headers, focusing on phishing, spam, and legitimate messages. A Multilayer Perceptron (MLP) is used for classification, and a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) is applied to generate synthetic data. The Gumbel Softmax function simulates features from imbalanced datasets, and statistical tests evaluate the quality of the generated data. Ray Tune is used to optimize model hyperparameters. Results show that the proposed approach improves accuracy and generalization in e-mail header threat detection.References
AbdulNabi, I. and Yaseen, Q. (2021). Spam email detection using deep learning techniques. Procedia Computer Science, 184:853–858. The 12th International Conference on Ambient Systems, Networks and Technologies (ANT) / The 4th International Conference on Emerging Data and Industry 4.0 (EDI40) / Affiliated Workshops.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein gan.
Beaman, C. and Isah, H. (2022). Anomaly detection in emails using machine learning and header information.
Bountakas, P., Koutroumpouchos, K., and Xenakis, C. (2021). A comparison of natural language processing and machine learning methods for phishing email detection. In Proceedings of the 16th International Conference on Availability, Reliability and Security, ARES ’21, New York, NY, USA. Association for Computing Machinery.
Cormack, G. V. and Lynam, T. R. (2005). Trec 2007 public corpus. Permission is granted for research use only. Publishing the corpus or any part of it is prohibited.
Dhanalakshmi, R., Vijayaraghavan, N., Kumar, A., and Prathiba, B. S. B. (2024). Ai-based detection and analysis of phishing domains: Leveraging machine learning for enhanced cybersecurity. In 2024 International Conference on System, Computation, Automation and Networking (ICSCAN), pages 1–6. IEEE.
Franchina, L., Ferracci, S., and Palmaro, F. (2021). Detecting phishing e-mails using text mining and features analysis. In Italian Conference on Cybersecurity.
Greco, M., Chang, R., and Galdames, P. (2024). Educational phishing: An awareness campaign to learn how to detect phishing. In 2024 43rd International Conference of the Chilean Computer Science Society (SCCC), pages 1–5. IEEE.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773.
Guan, S. (2023). Performance analysis of convolutional neural networks and multilayer perceptron in generative adversarial networks. In 2023 IEEE 3rd International Conference on Power, Electronics and Computer Applications (ICPECA), pages 817–821.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of wasserstein gans.
Gupta, S., Pritwani, M., Shrivastava, A., Moharir, M., AR, A. K., et al. (2024). A comprehensive analysis of social engineering attacks: From phishing to prevention-tools, techniques and strategies. In 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), pages 1–8. IEEE.
II, J. T. W. (2023). headerparser: argparse for mail-style headers. Biblioteca Python.
Karim, A., Azam, S., Shanmugam, B., and Kannoorpatti, K. (2020). Efficient clustering of emails into spam and ham: The foundational study of a comprehensive unsupervised framework. IEEE Access, 8:154759–154788.
Kaushik, N., Rathore, T. S., and Kumar, P. (2024). Email traceback: Securing systems from phishing and malicious link prevention. In 2024 1st International Conference on Advances in Computing, Communication and Networking (ICAC2N), pages 647–652. IEEE.
Kulkarni, M., Kumar, S., Panjwani, Y., Moharir, M., Kumar, A. A., Baskaran, E., et al. (2024). Mitigating email phishing: analytical framework, simulation models, and preventive measures. In 2024 10th international conference on communication and signal processing (ICCSP), pages 1459–1464. IEEE.
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J. E., and Stoica, I. (2018). Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118.
Lopez-Paz, D. and Oquab, M. (2018). Revisiting classifier two-sample tests.
Luo, E., Young, L., Ho, G., Afifi, M., Schweighauser, M., Katz-Bassett, E., and Cidon, A. (2025). Characterizing the networks sending enterprise phishing emails. In International Conference on Passive and Active Network Measurement, pages 437–466. Springer.
Maddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
Nazario, J. (2006). Phishingcorpus homepage. Recuperado em Junho 2024.
Shahila, D. F. D., Rosi, A., Stephen, V., et al. (2024). Ai based phishing discrement for immense e-maildata. In 2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), volume 1, pages 270–277. IEEE.
Wosah, P. N., Ali Mirza, Q., and Sayers, W. (2024). Analysing the email data using stylometric method and deep learning to mitigate phishing attack. International Journal of Information Technology, pages 1–12.
Yilmaz, I., Masum, R., and Siraj, A. (2020). Addressing imbalanced data problem with generative adversarial network for intrusion detection. In 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), pages 25–30.
Zhou, T., Wu, H.-T., Lu, H., Xu, P., and Cheung, Y.-M. (2022). Password guessing based on gan with gumbel-softmax. Security and Communication Networks, 2022(1):5670629.
Arjovsky, M., Chintala, S., and Bottou, L. (2017). Wasserstein gan.
Beaman, C. and Isah, H. (2022). Anomaly detection in emails using machine learning and header information.
Bountakas, P., Koutroumpouchos, K., and Xenakis, C. (2021). A comparison of natural language processing and machine learning methods for phishing email detection. In Proceedings of the 16th International Conference on Availability, Reliability and Security, ARES ’21, New York, NY, USA. Association for Computing Machinery.
Cormack, G. V. and Lynam, T. R. (2005). Trec 2007 public corpus. Permission is granted for research use only. Publishing the corpus or any part of it is prohibited.
Dhanalakshmi, R., Vijayaraghavan, N., Kumar, A., and Prathiba, B. S. B. (2024). Ai-based detection and analysis of phishing domains: Leveraging machine learning for enhanced cybersecurity. In 2024 International Conference on System, Computation, Automation and Networking (ICSCAN), pages 1–6. IEEE.
Franchina, L., Ferracci, S., and Palmaro, F. (2021). Detecting phishing e-mails using text mining and features analysis. In Italian Conference on Cybersecurity.
Greco, M., Chang, R., and Galdames, P. (2024). Educational phishing: An awareness campaign to learn how to detect phishing. In 2024 43rd International Conference of the Chilean Computer Science Society (SCCC), pages 1–5. IEEE.
Gretton, A., Borgwardt, K. M., Rasch, M. J., Schölkopf, B., and Smola, A. (2012). A kernel two-sample test. Journal of Machine Learning Research, 13(25):723–773.
Guan, S. (2023). Performance analysis of convolutional neural networks and multilayer perceptron in generative adversarial networks. In 2023 IEEE 3rd International Conference on Power, Electronics and Computer Applications (ICPECA), pages 817–821.
Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. (2017). Improved training of wasserstein gans.
Gupta, S., Pritwani, M., Shrivastava, A., Moharir, M., AR, A. K., et al. (2024). A comprehensive analysis of social engineering attacks: From phishing to prevention-tools, techniques and strategies. In 2024 Second International Conference on Intelligent Cyber Physical Systems and Internet of Things (ICoICI), pages 1–8. IEEE.
II, J. T. W. (2023). headerparser: argparse for mail-style headers. Biblioteca Python.
Karim, A., Azam, S., Shanmugam, B., and Kannoorpatti, K. (2020). Efficient clustering of emails into spam and ham: The foundational study of a comprehensive unsupervised framework. IEEE Access, 8:154759–154788.
Kaushik, N., Rathore, T. S., and Kumar, P. (2024). Email traceback: Securing systems from phishing and malicious link prevention. In 2024 1st International Conference on Advances in Computing, Communication and Networking (ICAC2N), pages 647–652. IEEE.
Kulkarni, M., Kumar, S., Panjwani, Y., Moharir, M., Kumar, A. A., Baskaran, E., et al. (2024). Mitigating email phishing: analytical framework, simulation models, and preventive measures. In 2024 10th international conference on communication and signal processing (ICCSP), pages 1459–1464. IEEE.
Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J. E., and Stoica, I. (2018). Tune: A research platform for distributed model selection and training. arXiv preprint arXiv:1807.05118.
Lopez-Paz, D. and Oquab, M. (2018). Revisiting classifier two-sample tests.
Luo, E., Young, L., Ho, G., Afifi, M., Schweighauser, M., Katz-Bassett, E., and Cidon, A. (2025). Characterizing the networks sending enterprise phishing emails. In International Conference on Passive and Active Network Measurement, pages 437–466. Springer.
Maddison, C. J., Mnih, A., and Teh, Y. W. (2016). The concrete distribution: A continuous relaxation of discrete random variables. arXiv preprint arXiv:1611.00712.
Nazario, J. (2006). Phishingcorpus homepage. Recuperado em Junho 2024.
Shahila, D. F. D., Rosi, A., Stephen, V., et al. (2024). Ai based phishing discrement for immense e-maildata. In 2024 7th International Conference on Circuit Power and Computing Technologies (ICCPCT), volume 1, pages 270–277. IEEE.
Wosah, P. N., Ali Mirza, Q., and Sayers, W. (2024). Analysing the email data using stylometric method and deep learning to mitigate phishing attack. International Journal of Information Technology, pages 1–12.
Yilmaz, I., Masum, R., and Siraj, A. (2020). Addressing imbalanced data problem with generative adversarial network for intrusion detection. In 2020 IEEE 21st International Conference on Information Reuse and Integration for Data Science (IRI), pages 25–30.
Zhou, T., Wu, H.-T., Lu, H., Xu, P., and Cheung, Y.-M. (2022). Password guessing based on gan with gumbel-softmax. Security and Communication Networks, 2022(1):5670629.
Published
2025-09-01
How to Cite
TAVARES, Patrick M.; MASCARENHAS, Dalbert M..
A New Approach for Detecting Fake SMTP Headers Using Deep Learning and Synthetic Data Generation. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 921-937.
DOI: https://doi.org/10.5753/sbseg.2025.10418.
