Function Classification in Malware Reverse Engineering: A Comparative Analysis of Machine Learning Strategies
Abstract
Malware analysis is a fundamental activity in cybersecurity, yet it is complex and error-prone. One of the main challenges for security analysts is distinguishing between genuinely malicious code and legitimate library code within the pseudocode extracted from executable binaries. This paper evaluates two machine learning approaches to support this classification: one based on code embeddings and another on software metrics. As a result, the syntax-based analysis via embeddings achieved 97% accuracy, outperforming the metrics-based approach, which was faster but less precise. The results demonstrate the potential of embedding analysis to support the identification of malicious code, contributing to software reverse engineering.
References
Andrade, G., Cirilo, E., Durelli, V., Cafeo, B., and Adachi, E. (2020). Data-flow analysis heuristic for vulnerability detection on configurable systems. In Anais do VIII Workshop de Visualização, Evolução e Manutenção de Software, pages 25–32, Porto Alegre, RS, Brasil.
Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.
Ding, I. (2021). iot-malware: Iot malware dataset. [link]. Accessed: 2025-07-01.
Ebrahim, F. and Joy, M. (2023). Source code plagiarism detection with pre-trained model embeddings and automated machine learning. In Mitkov, R. and Angelova, G., editors, Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 301–309, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.
Jia, Y., Yu, Z., and Hong, Z. (2024). Semantic aware-based instruction embedding for binary code similarity detection. PLOS ONE, 19(6).
Kathuria, P., Aggarwal, V., and Gupta, D. (2021). A comprehensive investigation of computer-based and mobile-based malware, their countermeasures, and various detection methods. Computer Networks, 195:108–157.
Li, M., Wang, W., Wang, P., Wang, S., Wu, D., Liu, J., Xue, R., and Huo, W. (2017). Libd: Scalable and precise third-party library detection in android markets. In Proceedings of the 39th IEEE/ACM International Conference on Software Engineering (ICSE 2017), pages 335–346. IEEE.
Shalaginov, A., Banin, S., Dehghantanha, A., and Franke, K. (2018). Machine learning aided static malware analysis: A survey and tutorial. Computers Security, 80:41–60.
The OpenBSD Project (2025). Openbsd source tree. [link]. Accessed: 2025-07-01.
Votipka, D., Rabin, S. M., Micinski, K., Foster, J. S., and Mazurek, M. M. (2020). An observational investigation of reverse engineers’ processes. In Proceedings of the 29th USENIX Conference on Security Symposium, SEC’20.
vx-underground (2021). Malware source code collection. Accessed: 2025-07-04.
