Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen
Resumo
Evaluating the quality of synthetic data remains a persistent challenge in the Android malware domain due to instability and the lack of standardization among existing metrics. This work integrates into MalDataGen a Super-Metric that aggregates eight metrics across four fidelity dimensions, producing a single weighted score. Experiments involving ten generative models and five balanced datasets demonstrate that the Super-Metric is more stable and consistent than traditional metrics, exhibiting stronger correlations with the actual performance of classifiers.Referências
Alomari, E. S., Nuiaa, R. R., Alyasseri, Z. A. A., Mohammed, H. J., Sani, N. S., Esa, M. I., and Musawi, B. A. (2023). Malware detection using deep learning and correlation-based feature selection. Symmetry, 15(1).
Chundawat, V. S., Tarun, A. K., Mandal, M., Lahoti, M., and Narang, P. (2024). Tabsyn-dex: A universal metric for robust evaluation of synthetic tabular data.
da Silva, A. L. G., Kreutz, D., Diniz, A., Mansilha, R., and da Fonseca, C. N. (2025). Reducing instability in synthetic data evaluation with a super-metric in MalDataGen.
Dahmen, J. and Cook, D. (2019). Synsys: A synthetic data generation system for healthcare applications. Sensors, 19(5).
Dankar, F. K., Ibrahim, M. K., and Ismail, L. (2022). A multi-dimensional evaluation of synthetic data generators. IEEE Access, 10:11147–11158.
El Emam, K., Mosquera, L., Fang, X., and El-Hussuna, A. (2022). Utility metrics for evaluating synthetic health data generation methods: Validation study. JMIR Med Inform, 10(4):e35734.
Figueira, A. and Vaz, B. (2022). Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15).
Hao, S., Han, W., Jiang, T., Li, Y., Wu, H., Zhong, C., Zhou, Z., and Tang, H. (2024). Synthetic data in ai: Challenges, applications, and ethical implications. arXiv preprint arXiv:2401.01629.
Lee, P. (2025). Synthetic data and the future of ai. Cornell L. Rev., 110:1.
Nawshin, F., Gad, R., Unal, D., Al-Ali, A. K., and Suganthan, P. N. (2024). Malware detection for mobile computing using secure and privacy-preserving machine learning approaches: A comprehensive survey. Computers and Electrical Engineering, 117.
Nogueira, A. G. D., Paim, K. O., Bragança, H., Mansilha, R. B., and Kreutz, D. (2025). Synthetic data: Ai’s new weapon against android malware.
Paim, K., Nogueira, A., Kreutz, D., Cordeiro, W., and Mansilha, R. (2025). MalDataGen: A modular framework for synthetic tabular data generation in malware detection. In Anais Estendidos do XXV SBSeg, Porto Alegre, RS, Brasil. SBC.
Patki, N., Wedge, R., and Veeramachaneni, K. (2016). The synthetic data vault. In IEEE DSAA.
Platzer, M. and Reutterer, T. (2021). Holdout-based empirical assessment of mixed-type synthetic data. Frontier in Big Data.
Silva, A., Nogueira, A., Kreutz, D., Paim, K., Mansilha, R., and Fonseca, C. (2025). Além da similaridade: Uma super-métrica generalizável para avaliação de fidelidade em dados sintéticos de malware. In Anais do XXV SBSeg. SBC.
Chundawat, V. S., Tarun, A. K., Mandal, M., Lahoti, M., and Narang, P. (2024). Tabsyn-dex: A universal metric for robust evaluation of synthetic tabular data.
da Silva, A. L. G., Kreutz, D., Diniz, A., Mansilha, R., and da Fonseca, C. N. (2025). Reducing instability in synthetic data evaluation with a super-metric in MalDataGen.
Dahmen, J. and Cook, D. (2019). Synsys: A synthetic data generation system for healthcare applications. Sensors, 19(5).
Dankar, F. K., Ibrahim, M. K., and Ismail, L. (2022). A multi-dimensional evaluation of synthetic data generators. IEEE Access, 10:11147–11158.
El Emam, K., Mosquera, L., Fang, X., and El-Hussuna, A. (2022). Utility metrics for evaluating synthetic health data generation methods: Validation study. JMIR Med Inform, 10(4):e35734.
Figueira, A. and Vaz, B. (2022). Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15).
Hao, S., Han, W., Jiang, T., Li, Y., Wu, H., Zhong, C., Zhou, Z., and Tang, H. (2024). Synthetic data in ai: Challenges, applications, and ethical implications. arXiv preprint arXiv:2401.01629.
Lee, P. (2025). Synthetic data and the future of ai. Cornell L. Rev., 110:1.
Nawshin, F., Gad, R., Unal, D., Al-Ali, A. K., and Suganthan, P. N. (2024). Malware detection for mobile computing using secure and privacy-preserving machine learning approaches: A comprehensive survey. Computers and Electrical Engineering, 117.
Nogueira, A. G. D., Paim, K. O., Bragança, H., Mansilha, R. B., and Kreutz, D. (2025). Synthetic data: Ai’s new weapon against android malware.
Paim, K., Nogueira, A., Kreutz, D., Cordeiro, W., and Mansilha, R. (2025). MalDataGen: A modular framework for synthetic tabular data generation in malware detection. In Anais Estendidos do XXV SBSeg, Porto Alegre, RS, Brasil. SBC.
Patki, N., Wedge, R., and Veeramachaneni, K. (2016). The synthetic data vault. In IEEE DSAA.
Platzer, M. and Reutterer, T. (2021). Holdout-based empirical assessment of mixed-type synthetic data. Frontier in Big Data.
Silva, A., Nogueira, A., Kreutz, D., Paim, K., Mansilha, R., and Fonseca, C. (2025). Além da similaridade: Uma super-métrica generalizável para avaliação de fidelidade em dados sintéticos de malware. In Anais do XXV SBSeg. SBC.
Publicado
08/12/2025
Como Citar
SILVA, Anna Luiza Gomes da; KREUTZ, Diego; DINIZ, Angelo; MANSILHA, Rodrigo; FONSECA, Celso Nobre da.
Reducing Instability in Synthetic Data Evaluation with a Super-Metric in MalDataGen. In: ESCOLA REGIONAL DE REDES DE COMPUTADORES (ERRC), 22. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 137-143.
DOI: https://doi.org/10.5753/errc.2025.17810.