MalSynGen: redes neurais artificiais na geração de dados tabulares sintéticos para detecção de malware
Resumo
A MalSynGen é uma ferramenta que utiliza redes neurais artificiais para gerar dados sintéticos tabulares para o domínio de malware Android. Para avaliar sua performance foram aumentados os dados de dois datasets, considerando métricas de fidelidade estat́ıstica e utilidade. Os resultados indicam que MalSynGen é capaz de capturar padrões representativos para o aumento de dados tabulares.Referências
Amin, M. et. al (2022). Android malware detection through generative adversarial networks. TETT, 33(2).
Brown, A., Gupta, M., and Abdelsalam, M. (2024). Automated machine learning for deep learning based malware detection. Computers & Security, 137:103582.
Canbek, G., Taskaya Temizel, T., and Sagiroglu, S. (2021). BenchMetrics: A systematic benchmarking method for binary classification performance metrics. NCA, 33(21).
Casola, K. et. al. (2023). DroidAugmentor: uma ferramenta de treinamento e avaliação de cGANs para geração de dados sintéticos. In SBSeg.
Choi, E. et. al. (2017). Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pages 286–305.
Esteban, C., Hyland, S. L., and Rätsch, G. (2017). Real-valued (medical) time series generation with recurrent conditional GANs. arXiv preprint arXiv:1706.02633.
Kouliaridis, V. and Kambourakis, G. (2021). A comprehensive survey on machine learning techniques for Android malware detection. Information, 12(5):185.
Li, J., He, J., Li, W., Fang, W., Yang, G., and Li, T. (2024). SynDroid: An adaptive enhanced Android malware classification method based on CTGAN-SVM. Computers & Security, 137:103604.
Mimura, M. (2020). Using fake text vectors to improve the sensitivity of minority class for macro malware detection. JISA, 54:102600.
Nogueira, A. et. al. (2024). MalSynGen. [link].
Park, N. et. al (2018). Data synthesis based on Generative Adversarial Networks. arXiv preprint arXiv:1806.03384.
Paullada, A. et. al. (2021). Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11).
Platzer, M. and Reutterer, T. (2021). Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data. Frontier in Big Data.
Rainio, O., Teuho, J., and Klén, R. (2024). Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14(1):6086.
Rajabi, A. and Garibay, O. O. (2022). TabfairGAN: : Fair Tabular Data Generation with Generative Adversarial Networks. ML and Knowledge Extraction, 4(2):488.
Rocha V. et. al (2023). AMGenerator e AMExplorer: Geração de metadados e construção de datasets android. In Anais Estendidos do XXIII SBSeg. SBC.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. biom. bull., 1, 80.
Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019). Modeling Tabular Data Using Conditional GAN. Advances in NIPS, 32.
Xu, L. and Veeramachaneni, K. (2018). Synthesizing Tabular Data Using Generative Adversarial Networks. arXiv preprint arXiv:1811.11264.
Brown, A., Gupta, M., and Abdelsalam, M. (2024). Automated machine learning for deep learning based malware detection. Computers & Security, 137:103582.
Canbek, G., Taskaya Temizel, T., and Sagiroglu, S. (2021). BenchMetrics: A systematic benchmarking method for binary classification performance metrics. NCA, 33(21).
Casola, K. et. al. (2023). DroidAugmentor: uma ferramenta de treinamento e avaliação de cGANs para geração de dados sintéticos. In SBSeg.
Choi, E. et. al. (2017). Generating multi-label discrete patient records using generative adversarial networks. In Machine learning for healthcare conference, pages 286–305.
Esteban, C., Hyland, S. L., and Rätsch, G. (2017). Real-valued (medical) time series generation with recurrent conditional GANs. arXiv preprint arXiv:1706.02633.
Kouliaridis, V. and Kambourakis, G. (2021). A comprehensive survey on machine learning techniques for Android malware detection. Information, 12(5):185.
Li, J., He, J., Li, W., Fang, W., Yang, G., and Li, T. (2024). SynDroid: An adaptive enhanced Android malware classification method based on CTGAN-SVM. Computers & Security, 137:103604.
Mimura, M. (2020). Using fake text vectors to improve the sensitivity of minority class for macro malware detection. JISA, 54:102600.
Nogueira, A. et. al. (2024). MalSynGen. [link].
Park, N. et. al (2018). Data synthesis based on Generative Adversarial Networks. arXiv preprint arXiv:1806.03384.
Paullada, A. et. al. (2021). Data and its (dis) contents: A survey of dataset development and use in machine learning research. Patterns, 2(11).
Platzer, M. and Reutterer, T. (2021). Holdout-Based Empirical Assessment of Mixed-Type Synthetic Data. Frontier in Big Data.
Rainio, O., Teuho, J., and Klén, R. (2024). Evaluation metrics and statistical tests for machine learning. Scientific Reports, 14(1):6086.
Rajabi, A. and Garibay, O. O. (2022). TabfairGAN: : Fair Tabular Data Generation with Generative Adversarial Networks. ML and Knowledge Extraction, 4(2):488.
Rocha V. et. al (2023). AMGenerator e AMExplorer: Geração de metadados e construção de datasets android. In Anais Estendidos do XXIII SBSeg. SBC.
Wilcoxon, F. (1945). Individual comparisons by ranking methods. biom. bull., 1, 80.
Xu, L., Skoularidou, M., Cuesta-Infante, A., and Veeramachaneni, K. (2019). Modeling Tabular Data Using Conditional GAN. Advances in NIPS, 32.
Xu, L. and Veeramachaneni, K. (2018). Synthesizing Tabular Data Using Generative Adversarial Networks. arXiv preprint arXiv:1811.11264.
Publicado
16/09/2024
Como Citar
NOGUEIRA, Angelo Gaspar Diniz; PAIM, Kayua Oleques; BRAGANÇA, Hendrio; MANSILHA, Rodrigo; KREUTZ, Diego.
MalSynGen: redes neurais artificiais na geração de dados tabulares sintéticos para detecção de malware. In: SALÃO DE FERRAMENTAS - SIMPÓSIO BRASILEIRO DE SEGURANÇA DA INFORMAÇÃO E DE SISTEMAS COMPUTACIONAIS (SBSEG), 24. , 2024, São José dos Campos/SP.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 129-136.
DOI: https://doi.org/10.5753/sbseg_estendido.2024.243359.