Hy-Synergy: geração sintética de dados tabulares em saúde guiada por diagnóstico local

  • Mauro Henrique Lima de Boni IFTO / UFG
  • Iwens Gervásio Sene Junior UFG
  • Ronaldo Martins da Costa UFG

Resumo


Dados tabulares clínicos desbalanceados e sensíveis dificultam o treinamento de modelos preditivos robustos. Este trabalho apresenta o Hy-Synergy, um framework modular de aumento de dados baseado em diagnóstico topológico, descoberta de clusters, política local de decisão e geração sintética roteada. Avaliado nos datasets Pima Indians Diabetes e Adult Census Income, o método apresentou o melhor equilíbrio entre fidelidade, utilidade e privacidade nos dois domínios, alcançando C2ST de 0.5478, JSD de 0.0312, ROC AUC de 0.8270 e MIA de 0.5242 no Pima. Os resultados sugerem que intervenções orientadas por estrutura local são promissoras para expansão de bases clínicas pequenas e sensíveis.

Referências

Alshantti, A., Varagnolo, D., Rasheed, A., Rahmati, A., and Westad, F. (2024). Castgan: Cascaded generative adversarial network for realistic tabular data synthesis. IEEE Access, 12:13213–13232.

Apellaniz, P. A., Parras, J., and Zazo, S. (2024). An improved tabular data generator with VAE-GMM integration. In Proceedings of the 32nd European Signal Processing Conference (EUSIPCO). IEEE.

Becker, B. and Kohavi, R. (1996). Adult. UCI Machine Learning Repository.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. P. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.

de Boni, M. H. L., Sene Junior, I. G., and Costa, R. M. d. (2025). Tabular data augmentation using artificial intelligence: A systematic review and taxonomic framework. IEEE Access, 13.

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.

Eckardt, J.-N., Hahn, W., Röllig, C., Stasik, S., Platzbecker, U., Müller-Tidow, C., Serve, H., Baldus, C. D., Schliemann, C., Schäfer-Eckart, K., Hanoun, M., Kaufmann, M., Burchert, A., Thiede, C., Schetelig, J., Bornhäuser, M., Wolfien, M., and Middeke, J. M. (2023). Mimicking clinical trials with synthetic acute myeloid leukemia patients using generative artificial intelligence. Blood, 142:2268–2268.

Han, H., Wang, W.-Y., and Mao, B.-H. (2005). Borderline-smote: A new over-sampling method in imbalanced data sets learning. In Advances in Intelligent Computing, volume 3644 of Lecture Notes in Computer Science, pages 878–887. Springer.

He, H., Bai, Y., Garcia, E. A., and Li, S. (2008). Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), pages 1322–1328. IEEE.

Inan, M. S. K., Hossain, S., and Uddin, M. N. (2023). Data augmentation guided breast cancer diagnosis and prognosis using an integrated deep-generative framework based on breast tumor’s morphological information. Informatics in Medicine Unlocked, 37.

Kang, H. Y. J., Batbaatar, E., Choi, D. W., Choi, K. S., Ko, M., and Ryu, K. S. (2023). Synthetic tabular data based on generative adversarial networks in health care: Generation and validation using the divide-and-conquer strategy. JMIR Medical Informatics, 11.

Napierala, K. and Stefanowski, J. (2016). Types of imbalanced data, differentiation of methods, and appropriate strategies. Information Sciences, 330:223–244.

Nguyen, H. M., Cooper, E. W., and Kamei, K. (2011). Borderline over-sampling for imbalanced data classification. International Journal of Knowledge Engineering and Soft Data Paradigms, 3(1):4–21.

Panfilo, D., Boudewijn, A., Saccani, S., Coser, A., Svara, B., Chauvenet, C. R., Mami, C. A., and Medvet, E. (2023). A deep learning-based pipeline for the generation of synthetic tabular data. IEEE Access, 11:63306–63323.

Patki, N., Wedge, R., and Veeramachaneni, K. (2016). The synthetic data vault. In 2016 IEEE International Conference on Data Science and Advanced Analytics Workshops (DSAAW), pages 399–410. IEEE.

Rodriguez-Almeida, A. J., Fabelo, H., Ortega, S., Deniz, A., Balea-Fernandez, F. J., Quevedo, E., Soguero-Ruiz, C., Wagner, A. M., and Callico, G. M. (2023). Synthetic patient data generation and evaluation in disease prediction using small and imbalanced datasets. IEEE Journal of Biomedical and Health Informatics, 27:2670–2680.

Smith, J. W., Everhart, J. E., Dickson, W., Knowler, W. C., and Johannes, R. S. (1988). Pima indians diabetes database. UCI Machine Learning Repository.

Smith, M. R., Martinez, T., and Giraud-Carrier, C. (2014). An instance level analysis of data complexity. Machine Learning, 95(2):225–256.

Wang, W. and Pai, T. W. (2023). Enhancing small tabular clinical trial dataset through hybrid data augmentation: Combining smote and wcgan-gp. Data, 8.

Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019). Modeling tabular data using conditional GAN. In Advances in Neural Information Processing Systems, volume 32.
Publicado
01/06/2026
BONI, Mauro Henrique Lima de; SENE JUNIOR, Iwens Gervásio; COSTA, Ronaldo Martins da. Hy-Synergy: geração sintética de dados tabulares em saúde guiada por diagnóstico local. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 26. , 2026, Ouro Preto/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2026 . p. 1122-1133. ISSN 2763-8952. DOI: https://doi.org/10.5753/sbcas.2026.21649.