Geração de dados sintéticos tabulares para detecção de malware Android: um estudo de caso
Resumo
Apresentamos um estudo sobre o uso de cGANs para expandir datasets de malware Android. Após uma exploração emṕırica de hiperparâmetros, aplicamos nossa cGAN para expandir quatro datasets e avaliamos os resultados considerando métricas de fidelidade estat́ıstica e a utilidade para algoritmos de ML na classificação de aplicativos Android. Os resultados confirmam a importância do ajuste adequado de hiperparâmetros, bem como a capacidade das cGANs de sintetizar datasets fiéis e úteis.
Referências
Alaa, A. et. al. (2022). How faithful is your synthetic data? In ICML.
Allix, K. et. al. (2015). Are your training datasets yet relevant? an investigation into the importance of timeline in machine learning-based malware detection. In ESSoS.
Esteban, C., Hyland, S. L., and Rätsch, G. (2017). Real-valued (medical) time series generation with recurrent conditional GANs. arXiv preprint arXiv:1706.02633.
Fakoor, R. et. al. (2020). Fast, accurate, and simple models for tabular data via augmented distillation. Advances in Neural Information Processing Systems, 33:8671–8681.
Fang, J., Tang, C., Cui, Q., Zhu, F., Li, L., Zhou, J., and Zhu, W. (2022). Semi-supervised learning with data augmentation for tabular data. In ACM CIKM.
GitHub (2024). Malsyngen. [link].
Gonog, L. et. al. (2019). A review: generative adversarial networks. In IEEE ICIEA.
Lu, Y. and Li, J. (2019). Generative adversarial network for improving deep learning based malware classification. In WSC. IEEE.
Machado, P., Fernandes, B., and Novais, P. (2022). Benchmarking data augmentation techniques for tabular data. In IDEAL, page 104. Springer.
McKnight, P. E. and Najab, J. (2010). Mann-whitney u test. The Corsini encyclopedia of psychology, pages 1–1.
Meijin, L., Zhiyang, F., Junfeng, W., Luyu, C., Qi, Z., Tao, Y., Yinwei, W., and Jiaxuan, G. (2022). A systematic overview of android malware detection. Applied AI, 36(1).
Mimura, M. (2020). Using fake text vectors to improve the sensitivity of minority class for macro malware detection. JISA, 54:102600.
Mirza, M. and Osindero, S. (2014). Conditional generative adversarial nets. arXiv preprint arXiv 1411 1784.
Platzer, M. and Reutterer, T. (2021). Holdout-based empirical assessment of mixed-type synthetic data. Frontier in Big Data.
Rajabi, A. and Garibay, O. O. (2022). Tabfairgan: Fair tabular data generation with generative adversarial networks. Machine Learning and Knowledge Extraction, 4(2):488.
Rocha, V., Assolin, J., Bragança, H., Kreutz, D., and Feitosa, E. (2023). AMGenerator e AMExplorer: Geração de metadados e construção de datasets android. In XXIII SBSeg.
Snow, D. (2020). Deltapy: A framework for tabular data augmentation in python. Available at SSRN 3582219.
Tanaka, F. H. K. D. S. and Aranha, C. (2019). Data augmentation using GANs. arXiv preprint arXiv:1904.09135.
Zhao, Z., Kunar, A., Birke, R., Van der Scheer, H., and Chen, L. Y. (2024). Ctab-gan+: Enhancing tabular data synthesis. Frontiers in Big Data, 6.