Techniques for Dealing with Imbalanced Data: A Systematic Literature Review

Leandro O. da Silva; Daniela L. Freire; Márcio P. Basgalupp; André C. P. L. F. de Carvalho

doi:10.5753/stil.2025.37843

Leandro O. da Silva USP
Daniela L. Freire USP
Márcio P. Basgalupp USP
André C. P. L. F. de Carvalho USP

DOI: https://doi.org/10.5753/stil.2025.37843

Resumo

This systematic review of the literature addresses techniques employed to address the problem of data imbalance. A variety of articles were analyzed, exploring strategies such as under-sampling, oversampling, and their combinations to address asymmetry in class distributions. Sensitive metrics, including recall, precision, and F1 score, emerge as crucial in imbalanced contexts. The studies reveal the challenges in selecting appropriate strategies and underscore the importance of adaptive approaches. Innovative solutions, such as adaptive combinations of techniques and integration with specific algorithms, are discussed. The ongoing need for research to address the specific challenges of data imbalance is highlighted.

Referências

Guo, J., Wu, H., Chen, X., and Lin, W. (2023). Adaptive sv-borderline smote-svm algorithm for imbalanced data classification. SSRN Electronic Journal.

Jonathan, B., Putra, P. O. H., and Ruldeviyani, Y. (2020). Observation imbalanced data text to predict users selling products on female daily with smote, tomek, and smote-tomek. pages 81–85.

Kiran, A. and Kumar, S. S. (2023). A comparative analysis of gan and vae based synthetic data generators for high dimensional, imbalanced tabular data. In 2023 2nd International Conference for Innovation in Technology (INOCON), pages 1–6.

Kitchenham, B. and Charters, S. (2007). Guidelines for performing systematic literature reviews in software engineering. Technical report, EBSE Technical Report, Keele University and University of Durham.

Maldonado, S., López, J., and Vairetti, C. (2019). An alternative smote oversampling strategy for high-dimensional datasets. Applied Soft Computing, 76:380–389.

Moniruzzaman, M., Bagirov, A., and Gondal, I. (2020). Partial undersampling of imbalanced data for cyber threats detection. In Proceedings of the Australasian Computer Science Week Multiconference, ACSW ’20, New York, NY, USA. Association for Computing Machinery.

Nhita, F., Adiwijaya, K., and Kurniawan, I. (2023). Performance and statistical evaluation of three sampling approaches in handling binary imbalanced data sets. pages 420–425.

Pal, K. and Patel, B. V. (2020). Data classification with k-fold cross validation and holdout accuracy estimation methods with 5 different machine learning techniques. In 2020 Fourth International Conference on Computing Methodologies and Communication (ICCMC), pages 83–87.

Rathpisey, H. and Adji, T. B. (2019). Handling imbalance issue in hate speech classification using sampling-based methods. In 2019 5th International Conference on Science in Information Technology (ICSITech), pages 193–198.

Ren, J., Wang, Y., and Deng, X. (2023). Slack-factor-based fuzzy support vector machine for class imbalance problems. ACM Trans. Knowl. Discov. Data, 17(6).

Rupapara, V., Rustam, F., Shahzad, H. F., Mehmood, A., Ashraf, I., and Choi, G. S. (2021). Impact of smote on imbalanced text features for toxic comments classification using rvvc model. IEEE Access, 9:78621–78634.

Sowah, R. A., Kuditchar, B., Mills, G. A., Acakpovi, A., Twum, R. A., Buah, G., and Agboyi, R. (2021). Hcbst: An efficient hybrid sampling technique for class imbalance problems. ACM Trans. Knowl. Discov. Data, 16(3).

Suhana, S. S. and Kumar, S. A. (2022). An novel adaptive solution in machine learning approaches for mining serendipitous drug usage to handle imbalanced data from social media comparing with adaboost algorithm. In 2022 International Conference on Sustainable Computing and Data Communication Systems (ICSCDS), pages 311–314.

Sun, L., Li, M., Ding, W., and Xu, J. (2023). Adaptive fuzzy multineighborhood feature selection with hybrid sampling and its application for classimbalanced data. Applied Soft Computing, 149:110968.

Tashkandi, A. and Wiese, L. (2020). A hybrid machine learning approach for improving mortality risk prediction on imbalanced data. In Proceedings of the 21st International Conference on Information Integration and Web-Based Applications & Services, iiWAS2019, page 83–92, New York, NY, USA. Association for Computing Machinery.

Verdikha, N. A., Thamrin, H., Triyono, A., Abdillah, M. F., and Suryawan, S. H. (2023). Regression and oversampling method for indonesian language automated essay scoring. AIP Conference Proceedings, 2727(1):040020.

Wang, J., Wu, Y., Qi, J., and Chen, Z. (2022). An efficient referencepoint based k neighbors algorithm for imbalanced data. In 2022 7th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA), pages 513–517.

Xue, L., Wu, H., Zheng, H., and He, Z. (2023). Control chart pattern recognition for imbalanced data based on multi-feature fusion using convolutional neural network. Comput. Ind. Eng., 182(C).

Yang, C., Dong, Y., Lu, J., and Peng, Z. (2023a). Solving imbalanced data in credit risk prediction: A comparison of resampling strategies for different machine learning classification algorithms, taking threshold tuning into account. In Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence, MLMI ’22, page 30–40, New York, NY, USA. Association for Computing Machinery.

Yang, R., Liu, J., Zhang, Q., and Zhang, L. (2023b). Multi-view feature fusion and density-based minority over-sampling technique for amyloid protein prediction under imbalanced data. Applied Soft Computing, 150:111100.

Zheng, K. (2023). Identifying churning employees: Machine learning algorithms from an unbalanced data perspective. In Proceedings of the 2022 5th International Conference on Machine Learning and Machine Intelligence, MLMI ’22, page 14–22, New York, NY, USA. Association for Computing Machinery.