Effectiveness Analysis of Oversampling Techniques By The Lens of Item Response Theory

  • Fabrício E. Corrêa UFPA
  • Lucas F. F. Cardoso UFPA / ITV
  • Vitor C. A. Santos UFPA / ITV
  • Regiane S. Kawasaki Francês UFPA
  • Ronnie C. O. Alves ITV

Resumo


It is increasingly common for sectors of society to use Machine Learning (ML) techniques to make decisions and make variations with the data generated. One of the most common problems that a dataset can present is imbalance. Under these conditions, the tendency is to produce biased models, which favor the majority class. To mitigate this problem, data balancing algorithms can be used, one of which is oversampling. However, it is not a simple task to define whether an oversampling technique really helps in the model learning process. The experiments carried out show that IRT is capable of revealing the impact of oversampling even when there is no variation in performance when using classical characteristics. Furthermore, the results found pointed to the existence of an imbalance threshold where oversampling techniques are more effective.
Palavras-chave: Machine Learning, Oversampling, Item Response Theory

Referências

Alabrah, A. (2023). An improved ccf detector to handle the problem of class imbalance with outlier normalization using iqr method. Sensors, 23(9):4406.

Araujo, E. A. C. d., Andrade, D. F. d., and Bortolotti, S. L. V. (2009). Teoria da resposta ao item. Revista da Escola de Enfermagem da USP, 43:1000–1008.

Barchilon, N. and Escovedo, T. (2021). Machine learning applied to the inss benefit request. In XVII Brazilian Symposium on Information Systems, pages 1–8.

Cardoso, L. F., de S. Ribeiro, J., Santos, V. C. A., Silva, R. L., Mota, M. P., Prudêncio, R. B., and Alves, R. C. (2022). Explanation-by-example based on item response theory. In Brazilian Conference on Intelligent Systems, pages 283–297. Springer.

Cardoso, L. F., Santos, V. C., Francês, R. S. K., Prudêncio, R. B., and Alves, R. C. (2020). Decoding machine learning benchmarks. In Brazilian Conference on Intelligent Systems, pages 412–425. Springer.

Castro, C. L. d. and Braga, A. P. (2011). Aprendizado supervisionado com conjuntos de dados desbalanceados. Sba: Controle & Automação Sociedade Brasileira de Automatica, 22:441–466.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research, 16:321–357.

dos Santos, C. N. (2005). Aprendizado de máquina na identificação de sintagmas nominais: o caso do português brasileiro. PhD thesis, Instituto Militar de Engenharia.

Dua, D., Graff, C., et al. (2019). Uci machine learning repository, 2017. URL [link], 7(1).

Fernández, A., Garcia, S., Herrera, F., and Chawla, N. V. (2018). Smote for learning from imbalanced data: progress and challenges, marking the 15-year anniversary. Journal of artificial intelligence research, 61:863–905.

Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: a new over-sampling method in imbalanced data sets learning. In International conference on intelligent computing, pages 878–887. Springer.

He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), pages 1322–1328. Ieee.

Lemaı̂tre, G., Nogueira, F., and Aridas, C. K. (2017). Imbalanced-learn: A python toolbox to tackle the curse of imbalanced datasets in machine learning. Journal of Machine Learning Research, 18(17):1–5.

LIMA, J. L. P. et al. (2020). Adversarial oversampling: um método para balanceamento baseado em redes neurais adversárias. Master’s thesis, Universidade Federal de Pernambuco.

Majumder, A., Dutta, S., Kumar, S., and Behera, L. (2020). A method for handling multi-class imbalanced data by geometry based information sampling and class prioritized synthetic data generation (gicaps). arXiv preprint arXiv:2010.05155.

MEC (2012). Teoria de resposta ao item avalia habilidade e minimiza o “chute” de candidatos. [link].

Monard, M. C. and Baranauskas, J. A. (2003). Conceitos sobre aprendizado de máquina. Sistemas inteligentes-Fundamentos e aplicações, 1(1):32.

Nguyen, H. M., Cooper, E. W., and Kamei, K. (2011). Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1):4–21.

Vanschoren, J., Van Rijn, J. N., Bischl, B., and Torgo, L. (2014). Openml: networked science in machine learning. ACM SIGKDD Explorations Newsletter, 15(2):49–60.
Publicado
17/11/2024
CORRÊA, Fabrício E.; CARDOSO, Lucas F. F.; SANTOS, Vitor C. A.; FRANCÊS, Regiane S. Kawasaki; ALVES, Ronnie C. O.. Effectiveness Analysis of Oversampling Techniques By The Lens of Item Response Theory. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 21. , 2024, Belém/PA. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 882-893. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2024.245221.