Beyond Similarity: A Generalizable Super-Metric for Fidelity Evaluation in Synthetic Malware Data
Abstract
Este trabalho propõe uma super-métrica flexível para a avaliação de dados sintéticos de malware, integrando oito medidas-chave (como RMSE, Jaccard e Wasserstein) em quatro dimensões principais: Distância, Correlação/Associação, Similaridade de Características e Distribuição Multivariada. A super-métrica permite o ajuste dinâmico de pesos entre essas dimensões, adaptando-se a diferentes cenários e objetivos de análise. A proposta não tem como finalidade substituir métricas existentes, mas sim oferecer um framework integrado e adaptável para avaliação multidimensional. Validada em quatro datasets de malware Android, utilizando o modelo TVAE, a supermétrica demonstrou forte correlação com a métrica de utilidade recall, além de apresentar estabilidade estatística significativamente superior à Similaridade do Cosseno, com desvio padrão três vezes menor (0,026 vs. 0,083) e menor amplitude (0,06 vs. 0,20) entre os conjuntos avaliados. Os resultados comprovam a eficácia da super-métrica na avaliação simultânea de aspectos locais e globais dos dados sintéticos, oferecendo maior robustez estatística em comparação com métricas convencionais.References
Almorjan, A., Basheri, M., and Almasre, M. (2025). Large language models for synthetic dataset generation of cybersecurity indicators of compromise. Sensors, 25(9).
Alomari, E. S., Nuiaa, R. R., Alyasseri, Z. A. A., Mohammed, H. J., Sani, N. S., Esa, M. I., and Musawi, B. A. (2023). Malware detection using deep learning and correlation-based feature selection. Symmetry, 15(1):123.
Boudewijn, A., Ferraris, A. F., Panfilo, D., Cocca, V., Zinutti, S., Schepper, K. D., and Chauvenet, C. R. (2023). Privacy measurement in tabular synthetic data: State of the art and future research directions.
Chundawat, V., Tarun, A., Mandal, M., Lahoti, M., and Narang, P. (2022). A universal metric for robust evaluation of synthetic tabular data. IEEE Transactions on Artificial Intelligence, PP:1–11.
Cortellazzi, J., Pendlebury, F., Arp, D., Quiring, E., Pierazzi, F., and Cavallaro, L. (2024). Intriguing properties of adversarial ml attacks in the problem space [extended version].
da Fonseca, C. N. (2024). Uma abordagem metodológica para a construção de critérios de informação a partir de superfícies quádricas. Tese de doutorado, FURG.
Figueira, A. and Vaz, B. (2022). Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15).
Hernadez, M., Epelde, G., Alberdi, A., Cilla, R., and Rankin, D. (2023). Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods Inf Med, 62(S 01):e19–e38.
Ibrahim, M., Khalil, Y. A., Amirrajab, S., Sun, C., Breeuwer, M., Pluim, J., Elen, B., Ertaylan, G., and Dumontier, M. (2025). Generative ai for synthetic data across multiple medical modalities: A systematic review of recent developments and challenges. Computers in Biology and Medicine, 189:109834.
Kingma, D. P., Welling, M., et al. (2013). Auto-encoding variational bayes.
Kitchenham, B., Brereton, P., Budgen, D., Turner, M., Bailey, M., and Linkman, S. (2009). Systematic literature reviews in software engineering – a systematic literature review. Information and Software Technology, 51(1):7–15.
Murtaza, H., Ahmed, M., Khan, N. F., Murtaza, G., Zafar, S., and Bano, A. (2023). Synthetic data generation: State of the art in health care domain. Computer Science Review, 48:100546.
Nawshin, F., Gad, R., Unal, D., Al-Ali, A. K., and Suganthan, P. N. (2024). Malware detection for mobile computing using secure and privacy-preserving machine learning approaches: A comprehensive survey. Computers and Electrical Engineering, 117:109233.
Perkonoja, K., Auranen, K., and Virta, J. (2024). Methods for generating and evaluating synthetic longitudinal patient data: a systematic review.
Pezoulas, V. C., Zaridis, D. I., Mylona, E., Androutsos, C., Apostolidis, K., Tachos, N. S., and Fotiadis, D. I. (2024). Synthetic data generation methods in healthcare: A review on open-source tools and methods. CSBJ, 23:2892–2910.
Sun, C., van Soest, J., and Dumontier, M. (2023). Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy. Journal of Biomedical Informatics, 143:104404.
Wang, W., Zhao, M., Gao, Z., Xu, G., Xian, H., Li, Y., and Zhang, X. (2019). Constructing features for detecting android malicious applications: issues, taxonomy and directions. IEEE access, 7:67602–67631.
Xin, B., Yang, W., Geng, Y., Chen, S., Wang, S., and Huang, L. (2020). Private fl-gan: Differential privacy synthetic data generation based on federated learning. In Icassp 2020-2020 IEEE ICASSP, pages 2927–2931. IEEE.
Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. In Adv Neural Inf Process Syst.
Alomari, E. S., Nuiaa, R. R., Alyasseri, Z. A. A., Mohammed, H. J., Sani, N. S., Esa, M. I., and Musawi, B. A. (2023). Malware detection using deep learning and correlation-based feature selection. Symmetry, 15(1):123.
Boudewijn, A., Ferraris, A. F., Panfilo, D., Cocca, V., Zinutti, S., Schepper, K. D., and Chauvenet, C. R. (2023). Privacy measurement in tabular synthetic data: State of the art and future research directions.
Chundawat, V., Tarun, A., Mandal, M., Lahoti, M., and Narang, P. (2022). A universal metric for robust evaluation of synthetic tabular data. IEEE Transactions on Artificial Intelligence, PP:1–11.
Cortellazzi, J., Pendlebury, F., Arp, D., Quiring, E., Pierazzi, F., and Cavallaro, L. (2024). Intriguing properties of adversarial ml attacks in the problem space [extended version].
da Fonseca, C. N. (2024). Uma abordagem metodológica para a construção de critérios de informação a partir de superfícies quádricas. Tese de doutorado, FURG.
Figueira, A. and Vaz, B. (2022). Survey on synthetic data generation, evaluation methods and gans. Mathematics, 10(15).
Hernadez, M., Epelde, G., Alberdi, A., Cilla, R., and Rankin, D. (2023). Synthetic tabular data evaluation in the health domain covering resemblance, utility, and privacy dimensions. Methods Inf Med, 62(S 01):e19–e38.
Ibrahim, M., Khalil, Y. A., Amirrajab, S., Sun, C., Breeuwer, M., Pluim, J., Elen, B., Ertaylan, G., and Dumontier, M. (2025). Generative ai for synthetic data across multiple medical modalities: A systematic review of recent developments and challenges. Computers in Biology and Medicine, 189:109834.
Kingma, D. P., Welling, M., et al. (2013). Auto-encoding variational bayes.
Kitchenham, B., Brereton, P., Budgen, D., Turner, M., Bailey, M., and Linkman, S. (2009). Systematic literature reviews in software engineering – a systematic literature review. Information and Software Technology, 51(1):7–15.
Murtaza, H., Ahmed, M., Khan, N. F., Murtaza, G., Zafar, S., and Bano, A. (2023). Synthetic data generation: State of the art in health care domain. Computer Science Review, 48:100546.
Nawshin, F., Gad, R., Unal, D., Al-Ali, A. K., and Suganthan, P. N. (2024). Malware detection for mobile computing using secure and privacy-preserving machine learning approaches: A comprehensive survey. Computers and Electrical Engineering, 117:109233.
Perkonoja, K., Auranen, K., and Virta, J. (2024). Methods for generating and evaluating synthetic longitudinal patient data: a systematic review.
Pezoulas, V. C., Zaridis, D. I., Mylona, E., Androutsos, C., Apostolidis, K., Tachos, N. S., and Fotiadis, D. I. (2024). Synthetic data generation methods in healthcare: A review on open-source tools and methods. CSBJ, 23:2892–2910.
Sun, C., van Soest, J., and Dumontier, M. (2023). Generating synthetic personal health data using conditional generative adversarial networks combining with differential privacy. Journal of Biomedical Informatics, 143:104404.
Wang, W., Zhao, M., Gao, Z., Xu, G., Xian, H., Li, Y., and Zhang, X. (2019). Constructing features for detecting android malicious applications: issues, taxonomy and directions. IEEE access, 7:67602–67631.
Xin, B., Yang, W., Geng, Y., Chen, S., Wang, S., and Huang, L. (2020). Private fl-gan: Differential privacy synthetic data generation based on federated learning. In Icassp 2020-2020 IEEE ICASSP, pages 2927–2931. IEEE.
Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019). Modeling tabular data using conditional gan. In Adv Neural Inf Process Syst.
Published
2025-09-01
How to Cite
SILVA, Anna Luiza Gomes da; NOGUEIRA, Angelo Gaspar Diniz; KREUTZ, Diego; PAIM, Kayuã Oleques; MANSILHA, Rodrigo Brandão; FONSECA, Celso Nobre da.
Beyond Similarity: A Generalizable Super-Metric for Fidelity Evaluation in Synthetic Malware Data. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 25. , 2025, Foz do Iguaçu/PR.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 955-962.
DOI: https://doi.org/10.5753/sbseg.2025.11451.
