Does Asm2Vec Reduce Drift on Malware Classification?

Rafael Rocha; Stefano de Rosa; Paolo Castagno; Idilio Drago; Lourenço Alves Pereira Junior

doi:10.5753/sbseg.2023.233605

Rafael Rocha ITA
Stefano de Rosa Università di Torino
Paolo Castagno Università di Torino
Idilio Drago Università di Torino
Lourenço Alves Pereira Junior ITA

DOI: https://doi.org/10.5753/sbseg.2023.233605

Resumo

O Asm2Vec é um algoritmo capaz de aprender representações de arquivos binários com base em técnicas de embeddings de palavras. Pesquisadores têm utilizado essa técnica para análise de binários, bem como para classificação de malware. No entanto, a classificação de malware é conhecida por ser amplamente afetada por drifting, ou seja, modelos construídos para classificar malware tornam-se obsoletos com o passar do tempo. Portanto, investigamos neste artigo se as abordagens de aprendizado de representação, como Asm2Vec, ajudam a reduzir o impacto do drifting na classificação de malware. Para responder a essa pergunta, projetamos um experimento usando dois datasets públicos de malware e treinamos modelos clássicos de aprendizado de máquina com (i) features estáticas extraídas de cabeçalhos de malware e (ii) features obtidas usando Asm2Vec. Nossos resultados mostram que há pouca diferença em relação ao efeito de drift e que os classificadores treinados com os recursos do Asm2Vec apresentam desempenho de classificação pior. Como contribuição, fornecemos insights iniciais sobre os efeitos do aprendizado de representação em drifiting na classificação de malware.

Referências

Aghakhani, H., Gritti, F., Mecca, F., Lindorfer, M., Ortolani, S., Balzarotti, D., Vigna, G., and Kruegel, C. (2020). When Malware is Packin’ Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features. In Proceedings of the Network and Distributed System Security Symposium, NDSS’20.

Allamanis, M., Brockschmidt, M., and Khademi, M. (2018). Learning to Represent Programs with Graphs. In Proceedings of the 6th International Conference on Learning Representations, ICLR’18.

Anderson, H. S. and Roth, P. (2018). Ember: An open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637.

Barbero, F., Pendlebury, F., Pierazzi, F., and Cavallaro, L. (2022). Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift. In Proceedings of the IEEE Symposium on Security and Privacy, SP’22, pages 805–823.

Boffa, M., Milan, G., Vassio, L., Drago, I., Mellia, M., and Houidi, Z. B. (2022). Towards NLP-based Processing of Honeypot Logs. In Proceedings of the IEEE European Symposium on Security and Privacy Workshops, EuroS&PW’22, pages 314–321.

Chandak, A., Lee, W., and Stamp, M. (2021). A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification. In Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham.

Dietmüller, A., Ray, S., Jacob, R., and Vanbever, L. (2022). A New Hope for Network Model Generalization. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets’22, pages 152–159.

Ding, S., Fung, B., and Charland, P. (2019). Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proceedings of the IEEE Symposium on Security and Privacy, SP’19, pages 472–489.

Gioacchini, L., Vassio, L., Mellia, M., Drago, I., Houidi, Z., and Rossi, D. (2021). DarkVec: Automatic Analysis of Darknet Traffic with Word Embeddings. In Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, CoNEXT ’21, pages 76–89.

Houidi, Z., Azorin, R., Gallo, M., Finamore, A., and Rossi, D. (2022). Towards a Systematic Multi-Modal Representation Learning for Network Data. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets’22, pages 181–187.

Jordaney, R., Sharad, K., Dash, S., Wang, Z., Papini, D., Nouretdinov, I., and Cavallaro, L. (2017). Transcend: Detecting Concept Drift in Malware Classification Models. In Proceedings of the 26th USENIX Security Symposium, USENIX Security’17, pages 625–642.

Kan, Z., Pendlebury, F., Pierazzi, F., and Cavallaro, L. (2021). Investigating Labelless Drift Adaptation for Malware Detection. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, AISec ’21, pages 123–134.

Le, F., Srivatsa, M., Ganti, R., and Sekar, V. (2022). Rethinking Data-driven Networking with Foundation Models: Challenges and Opportunities. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets’22, pages 188–197.

Marcelli, A., Graziano, M., Ugarte-Pedrero, X., Fratantonio, Y., Mansouri, M., and Balzarotti, D. (2022). How Machine Learning Is Solving the Binary Function Similarity Problem. In Proceedings of the 31st USENIX Security Symposium, USENIX Security’22, pages 2099–2116.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Le, Q. V., and Sutskever, I. (2013b). Exploiting Similarities among Languages for Machine Translation. arXiv preprint arXiv:1309.4168.

Narayanan, A., Chandramohan, M., Chen, L., and Liu, Y. (2017). Context-Aware, Adaptive, and Scalable Android Malware Detection Through Online Learning. IEEE Transactions on Emerging Topics in Computational Intelligence, 1(3):157–175.

Onwuzurike, L., Mariconti, E., Andriotis, P., Cristofaro, E. D., Ross, G., and Stringhini, G. (2019). Mamadroid: Detecting android malware by building markov chains of behavioral models (extended version). ACM Trans. Priv. Secur., 22(2).

Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., and Cavallaro, L. (2019). TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. In Proceedings of the 28th USENIX Security Symposium, USENIX Security’19, pages 729–746.

Xu, K., Li, Y., Deng, R., Chen, K., and Xu, J. (2019). DroidEvolver: Self-Evolving Android Malware Detection System. In Proceedings of the IEEE European Symposium on Security and Privacy, EuroS&P’19, pages 47–62.

Yang, L., Ciptadi, A., Laziuk, I., Ahmadzadeh, A., and Wang, G. (2021a). BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. In Proceedings of the IEEE Security and Privacy Workshops, pages 78–84.

Yang, L., Guo, W., Hao, Q., Ciptadi, A., Ahmadzadeh, A., Xing, X., and Wang, G. (2021b). CADE: Detecting and Explaining Concept Drift Samples for Security Applications. In Proceedings of the 30th USENIX Security Symposium, USENIX Security’21.

Zhang, X., Zhang, Y., Zhong, M., Ding, D., Cao, Y., Zhang, Y., Zhang, M., and Yang, M. (2020). Enhancing State-of-the-art Classifiers with API Semantics to Detect Evolved Android Malware. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, pages 757–770.

Does Asm2Vec Reduce Drift on Malware Classification?

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)