A Multi-Dimensional Comparative Study of Generative Adversarial Networks, Diffusion Models, and Statistical Methods for Synthetic Health Data Generation

  • Oluwatoyin Joy Omole UFMG
  • Celso França UFMG
  • Samuel N. Alves UFMG
  • Regina Bernal UFMG
  • Deborah Malta UFMG
  • Marcos André Gonçalves UFMG
  • Jussara M. Almeida UFMG

Resumo


Synthetic data is increasingly important in privacy-sensitive or data-scarce domains such as healthcare, where access to real data is constrained by confidentiality or data availability challenges. This study compares three approaches for the generation of synthetic tabular health data: a statistical method (Gaussian Copula), an adversarial deep learning model (CTGAN), and an implementation of a diffusion-based technique called SimpleTableDiffusion. These methods represent different modeling paradigms: statistical, adversarial, and stochastic, each offering trade-offs in interpretability, flexibility, training stability, and privacy. We assess their performance across three dimensions: (i) statistical fidelity, that is, how well they replicate real data distributions; (ii) utility, measured by the effectiveness of classifiers trained on synthetic data and evaluated on real data; and (iii) privacy preservation, using a disclosure risk metric that estimates the likelihood of sensitive attribute inference. Our results show that CTGAN achieves the best overall performance, leading in utility, privacy, and marginal distribution quality. The Gaussian Copula excels at modeling conditional dependencies but lags in predictive tasks. The diffusion-based model performs competitively across metrics but falls short of the other generative models. This work establishes a unified benchmark, pointing towards hybrid approaches that leverage complementary strengths.
Publicado
29/09/2025
OMOLE, Oluwatoyin Joy; FRANÇA, Celso; ALVES, Samuel N.; BERNAL, Regina; MALTA, Deborah; GONÇALVES, Marcos André; ALMEIDA, Jussara M.. A Multi-Dimensional Comparative Study of Generative Adversarial Networks, Diffusion Models, and Statistical Methods for Synthetic Health Data Generation. In: BRAZILIAN CONFERENCE ON INTELLIGENT SYSTEMS (BRACIS), 35. , 2025, Fortaleza/CE. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 19-34. ISSN 2643-6264.