SAGAD: Synthetic Data Generator for Tabular Datasets
Resumo
The accuracy of machine learning models implementing classification tasks is strongly dependent on the quality of the training dataset. This is a challenge for domains where data is not abundant, such as personalized medicine,or unbalance, as in the case of images of plant species, where some species have very few samples while others offer large number of samples. In both scenarios,the resulting models tend to offer poor performance. In this paper we present two techniques to face this challenge. Firstly, we present a data augmentation method called SAGAD, based on conditional entropy. SAGAD can balance minority classes in conjunction with the increase of the overall size of the trainingset. In our experiments, the application of SAGAD in small data problems with different machine learning algorithms yielded significant improvement in performance. We additionally present an extension of SAGAD for iterative learning algorithms, called DABEL, which generates new samples for each epoch usingan optimization approach that continuously improves the model’s performance. The adoption of SAGAD and DABEL consistently extends the training dataset towards improved target classification performance.
Referências
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: synthetic minority over-sampling technique. Journal of artificial intelligence research,16:321–357
Chen, M., Hao, Y., Hwang, K., Wang, L., and Wang, L. (2017). Disease prediction by machine learning over big data from healthcare communities. IEEE Access, 5:8869–8879.
Cugliari, G., Benevenuta, S., Guarrera, S., Sacerdote, C., Panico, S., Krogh, V., Tumino, R., Vineis, P., Fariselli, P., and Matullo, G. (2019). Improving the prediction of cardiovascular risk with machine-learning and dna methylation data. In2019 IEEE Conference on Computational Intelligence in Bioinformatics and Computational Biology(CIBCB), pages 1–4.
Dua, D. and Graff, C. (2017a). UCI machine learning repository.
Dua, D. and Graff, C. (2017b). UCI machine learning repository.
from Jed Wing, M. K. C., Weston, S., Williams, A., Keefer, C., Engelhardt, A., Cooper,T., Mayer, Z., Kenkel, B., the R Core Team, Benesty, M., Lescarbeau, R., Ziem, A.,Scrucca, L., Tang, Y., Candan, C., and Hunt., T. (2018).caret: Classification andRegression Training. R package version 6.0-80.
Mukherjee, M. and Khushi, M. (2021). Smote-enc: A novel smote-based method to generate synthetic data for nominal and continuous features. Applied System Innovation,4(1):18.
Porto, F., de Carvalho Moura, A. M., da Silva, F. C., Bassini, A., Palazzi, D. C., Poltosi,M., de Castro, L. E. V., and Cameron, L. C. (2012). A metaphoric trajectory datawarehouse for olympic athlete follow-up. Concurr. Comput. Pract. Exp., 24(13):1497–1512.
Prince, J. and De Vos, M. (2018). A deep learning framework for the remote detection ofparkinson’s disease using smart-phone sensor data. In 2018 40th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pages3144–3147. IEEE.
S. Pereira, R., Ferreira da Silva, H. M., and A.M Porto, F. (2021). AugmenterR: DataAugmentation for Machine Learning on Tabular Data. R package version 0.1.0.
Shorten, C. and Khoshgoftaar, T. M. (2019). A survey on image data augmentation fordeep learning. Journal of Big Data, 6(1):1–48.
Sturges, H. A. (1926). The choice of a class interval. Journal of the American Statistical Association, 21(153):65–66.
Van Dyk, D. A. and Meng, X.-L. (2001). The art of data augmentation. Journal of Computational and Graphical Statistics, 10(1):1–50.
Vanegas, M. I., Ghilardi, M. F., Kelly, S. P., and Blangero, A. (2018). Machine learning for eeg-based biomarkers in parkinson’s disease. In2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pages 2661–2665.
Zhang, S., Bamakan, S. M. H., Qu, Q., and Li, S. (2019). Learning for personalized medicine: A comprehensive review from a deep learning perspective. IEEE Reviews in Biomedical Engineering, 12:194–208.