Does Asm2Vec Reduce Drift on Malware Classification?

  • Rafael Rocha ITA
  • Stefano de Rosa Università di Torino
  • Paolo Castagno Università di Torino
  • Idilio Drago Università di Torino
  • Lourenço Alves Pereira Junior ITA

Abstract


Asm2Vec is an algorithm capable of learning representations for binary files using word embedding techniques. Researchers have employed this approach for binary analysis as well as malware classification. Malware classification is, however, known to be widely affected by drift, i.e., models built to identify a particular malware family become obsolete rapidly. We ask whether representation learning approaches such as Asm2Vec help reduce the impact of drift in malware classification. To answer this question, we design an experiment using two public malware datasets and train classic machine learning models with (i) static features extracted from malware headers and (ii) features obtained using Asm2Vec. Our results show that there is little difference in relation to the effect of drift and that the classifiers trained with Asm2Vec resources present worse classification performance. We provide initial insights into the effects of representation learning on the drift in malware classification.

References

Aghakhani, H., Gritti, F., Mecca, F., Lindorfer, M., Ortolani, S., Balzarotti, D., Vigna, G., and Kruegel, C. (2020). When Malware is Packin’ Heat; Limits of Machine Learning Classifiers Based on Static Analysis Features. In Proceedings of the Network and Distributed System Security Symposium, NDSS’20.

Allamanis, M., Brockschmidt, M., and Khademi, M. (2018). Learning to Represent Programs with Graphs. In Proceedings of the 6th International Conference on Learning Representations, ICLR’18.

Anderson, H. S. and Roth, P. (2018). Ember: An open dataset for training static pe malware machine learning models. arXiv preprint arXiv:1804.04637.

Barbero, F., Pendlebury, F., Pierazzi, F., and Cavallaro, L. (2022). Transcending TRANSCEND: Revisiting Malware Classification in the Presence of Concept Drift. In Proceedings of the IEEE Symposium on Security and Privacy, SP’22, pages 805–823.

Boffa, M., Milan, G., Vassio, L., Drago, I., Mellia, M., and Houidi, Z. B. (2022). Towards NLP-based Processing of Honeypot Logs. In Proceedings of the IEEE European Symposium on Security and Privacy Workshops, EuroS&PW’22, pages 314–321.

Chandak, A., Lee, W., and Stamp, M. (2021). A Comparison of Word2Vec, HMM2Vec, and PCA2Vec for Malware Classification. In Malware Analysis Using Artificial Intelligence and Deep Learning. Springer, Cham.

Dietmüller, A., Ray, S., Jacob, R., and Vanbever, L. (2022). A New Hope for Network Model Generalization. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets’22, pages 152–159.

Ding, S., Fung, B., and Charland, P. (2019). Asm2Vec: Boosting Static Representation Robustness for Binary Clone Search against Code Obfuscation and Compiler Optimization. In Proceedings of the IEEE Symposium on Security and Privacy, SP’19, pages 472–489.

Gioacchini, L., Vassio, L., Mellia, M., Drago, I., Houidi, Z., and Rossi, D. (2021). DarkVec: Automatic Analysis of Darknet Traffic with Word Embeddings. In Proceedings of the 17th International Conference on emerging Networking EXperiments and Technologies, CoNEXT ’21, pages 76–89.

Houidi, Z., Azorin, R., Gallo, M., Finamore, A., and Rossi, D. (2022). Towards a Systematic Multi-Modal Representation Learning for Network Data. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets’22, pages 181–187.

Jordaney, R., Sharad, K., Dash, S., Wang, Z., Papini, D., Nouretdinov, I., and Cavallaro, L. (2017). Transcend: Detecting Concept Drift in Malware Classification Models. In Proceedings of the 26th USENIX Security Symposium, USENIX Security’17, pages 625–642.

Kan, Z., Pendlebury, F., Pierazzi, F., and Cavallaro, L. (2021). Investigating Labelless Drift Adaptation for Malware Detection. In Proceedings of the 14th ACM Workshop on Artificial Intelligence and Security, AISec ’21, pages 123–134.

Le, F., Srivatsa, M., Ganti, R., and Sekar, V. (2022). Rethinking Data-driven Networking with Foundation Models: Challenges and Opportunities. In Proceedings of the 21st ACM Workshop on Hot Topics in Networks, HotNets’22, pages 188–197.

Marcelli, A., Graziano, M., Ugarte-Pedrero, X., Fratantonio, Y., Mansouri, M., and Balzarotti, D. (2022). How Machine Learning Is Solving the Binary Function Similarity Problem. In Proceedings of the 31st USENIX Security Symposium, USENIX Security’22, pages 2099–2116.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient Estimation of Word Representations in Vector Space. arXiv preprint arXiv:1301.3781.

Mikolov, T., Le, Q. V., and Sutskever, I. (2013b). Exploiting Similarities among Languages for Machine Translation. arXiv preprint arXiv:1309.4168.

Narayanan, A., Chandramohan, M., Chen, L., and Liu, Y. (2017). Context-Aware, Adaptive, and Scalable Android Malware Detection Through Online Learning. IEEE Transactions on Emerging Topics in Computational Intelligence, 1(3):157–175.

Onwuzurike, L., Mariconti, E., Andriotis, P., Cristofaro, E. D., Ross, G., and Stringhini, G. (2019). Mamadroid: Detecting android malware by building markov chains of behavioral models (extended version). ACM Trans. Priv. Secur., 22(2).

Pendlebury, F., Pierazzi, F., Jordaney, R., Kinder, J., and Cavallaro, L. (2019). TESSERACT: Eliminating Experimental Bias in Malware Classification across Space and Time. In Proceedings of the 28th USENIX Security Symposium, USENIX Security’19, pages 729–746.

Xu, K., Li, Y., Deng, R., Chen, K., and Xu, J. (2019). DroidEvolver: Self-Evolving Android Malware Detection System. In Proceedings of the IEEE European Symposium on Security and Privacy, EuroS&P’19, pages 47–62.

Yang, L., Ciptadi, A., Laziuk, I., Ahmadzadeh, A., and Wang, G. (2021a). BODMAS: An Open Dataset for Learning based Temporal Analysis of PE Malware. In Proceedings of the IEEE Security and Privacy Workshops, pages 78–84.

Yang, L., Guo, W., Hao, Q., Ciptadi, A., Ahmadzadeh, A., Xing, X., and Wang, G. (2021b). CADE: Detecting and Explaining Concept Drift Samples for Security Applications. In Proceedings of the 30th USENIX Security Symposium, USENIX Security’21.

Zhang, X., Zhang, Y., Zhong, M., Ding, D., Cao, Y., Zhang, Y., Zhang, M., and Yang, M. (2020). Enhancing State-of-the-art Classifiers with API Semantics to Detect Evolved Android Malware. In Proceedings of the ACM SIGSAC Conference on Computer and Communications Security, CCS ’20, pages 757–770.
Published
2023-09-18
ROCHA, Rafael; ROSA, Stefano de; CASTAGNO, Paolo; DRAGO, Idilio; PEREIRA JUNIOR, Lourenço Alves. Does Asm2Vec Reduce Drift on Malware Classification?. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 23. , 2023, Juiz de Fora/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 195-208. DOI: https://doi.org/10.5753/sbseg.2023.233605.

Most read articles by the same author(s)

1 2 > >>