Data Sampling Optimization for Improved Classification of Imbalanced Phishing Datasets

  • José Maurício Silva CESAR School
  • Carlo Marcelo R. Silva UPE
  • Mateus L. S. D. Barros UFRPE
  • João Guilherme Miranda UFRPE
  • Márcio P. Basgalupp UNIFESP
  • Péricles B. C. Miranda UFRPE

Abstract


Phishing is a social engineering attack that captures information by impersonating trusted entities. To detect it, researchers use Machine Learning as a classification task. However, phishing datasets are often imbalanced due to Concept Drift and the semantic nature of attacks. Oversampling, undersampling, and hybrid techniques address this, with hybrids combining both strategies for better results. This study examines the impact of optimizationbased sequencing of sampling algorithms on phishing data and compares it to traditional methods. Results show that optimized sequences improve classifier performance and reduce the effects of imbalance.

References

Abdelhamid, N., Ayesh, A., and Thabtah, F. (2014). Phishing detection based associative classification data mining. Expert Systems with Applications, 41:5948–5959.

Ahsan, M., Gomes, R., and Denton, A. (2018). Smote implementation on phishing data to enhance cybersecurity. In 2018 IEEE International Conference on Electro/Information Technology (EIT), pages 0531–0536.

Barbosa, G., Camelo, R., Cavalcanti, A. P., Miranda, P., Mello, R. F., Kovanović, V., and Gašević, D. (2020a). Towards automatic cross-language classification of cognitive presence in online discussions. In Proceedings of the tenth international conference on learning analytics & knowledge, pages 605–614.

Barbosa, G., Miranda, P., Mello, R., and Silva, R. (2019). Sequenciamento de algoritmos de amostragem para aumentar o desempenho de classificadores em conjuntos de dados desequilibrados. In Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, pages 413–423.

Barbosa, G., Miranda, P., Silva, R., and Mello, R. (2020b). Sequenciamento de algoritmos de amostragem para aumentar o desempenho de classificadores em conjuntos de dados desequilibrados. In XVI Encontro Nacional de Inteligência Artificial e Computacional, pages 413–423. SBC.

Barros, M., Silva, C., and Miranda, P. (2019). Adoção da seleção de características como mecanismo antiphishing: aplicabilidade e impactos. In Anais do XVI Encontro Nacional de Inteligência Artificial e Computacional, pages 214–225.

Barros, M., Silva, C., and Miranda, P. (2020). Xphide: Um sistema especialista para a detecção de phishing. In Anais do XX Simpósio Brasileiro em Segurança da Informação e de Sistemas Computacionais, pages 161–174, Porto Alegre, RS, Brasil. SBC.

de Barros, M., da Silva, C., and de Miranda, P. (2019). Aplicabilidade e impactos quanto a adoção de modelos de classificação como mecanismos anti-phishing. In Anais Estendidos do XIX Simpósio Brasileiro de Segurança da Informação e de Sistemas Computacionais, pages 39–42.

Deb, K., Pratap, A., Agarwal, S., and Meyarivan, T. (2002). A fast and elitist multiobjective genetic algorithm: Nsga-ii. IEEE Transactions on Evolutionary Computation, 6(2):182–197.

Ding, X., Liu, B., Jiang, Z.,Wang, Q., and Xin, L. (2021). Spear phishing emails detection based on machine learning. In 2021 IEEE 24th CSCWD, pages 354–359.

Gomes de Barros, J. C., Revoredo da Silva, C. M., Candeia Teixeira, L., Torres Fernandes, B. J., Lorenzato de Oliveira, J. F., Luzeiro Feitosa, E., Pinheiro dos Santos, W., Ferraz Arcoverde, H., and Cardoso Garcia, V. (2022). Piracema: a phishing snapshot database for building dataset features. Scientific Reports, 12(1):15149.

Haixiang, G., Yijing, L., Shang, J., Mingyun, G., Yuanyue, H., and Bing, G. (2017). Learning from class-imbalanced data. Expert Syst. Appl., 73(C):220–239.

Miranda, P. B., Mello, R. F., Nascimento, A. C., and Si, T. (2022). Multi-objective optimization of sampling algorithms pipeline for unbalanced problems. In 2022 IEEE Congress on Evolutionary Computation (CEC), pages 1–8. IEEE.

Oliveira, H., Ferreira Mello, R., Barreiros Rosa, B. A., Rakovic, M., Miranda, P., Cordeiro, T., Isotani, S., Bittencourt, I., and Gasevic, D. (2023). Towards explainable prediction of essay cohesion in portuguese and english. In LAK23, pages 509–519.

Prayogo, R. D. and Karimah, S. A. (2020). Optimization of phishing website classification based on synthetic minority oversampling technique and feature selection. In 2020 International Workshop on Big Data and Information Security (IWBIS), pages 121–126.

Pristyanto, Y. and Dahlan, A. (2019). Hybrid resampling for imbalanced class handling on web phishing classification dataset. In 2019 4th ICITISEE, pages 401–406.

Srivastava, J. and Sharan, A. (2022). SMOTEEN Hybrid Sampling Based Improved Phishing Website Detection. Pre-Print.
Published
2025-09-01
SILVA, José Maurício; SILVA, Carlo Marcelo R.; BARROS, Mateus L. S. D.; MIRANDA, João Guilherme; BASGALUPP, Márcio P.; MIRANDA, Péricles B. C.. Data Sampling Optimization for Improved Classification of Imbalanced Phishing Datasets. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 1011-1018. DOI: https://doi.org/10.5753/sbseg.2025.8030.

Most read articles by the same author(s)

1 2 > >>