Classificação de Funções em Engenharia Reversa de Malware: Uma Análise Comparativa de Estratégias de Machine Learning

Vitor Reis; Elverton Fazzion; Elder Cirilo

doi:10.5753/vem.2025.14618

Vitor Reis UFSJ
Elverton Fazzion UFSJ
Elder Cirilo UFSJ

DOI: https://doi.org/10.5753/vem.2025.14618

Resumo

A análise de software malicioso é uma atividade fundamental em cibersegurança, porém complexa e propensa a erros. Um dos principais desafios para os analistas de segurança é a distinção entre o código genuinamente malicioso e o código de bibliotecas legítimas presentes no pseudocódigo extraído de binários executáveis. Este artigo avalia duas estratégias de aprendizado de máquina para apoiar esta classificação: uma baseada em embeddings de código e outra em métricas de software. Como resultado, foi observado que a análise sintática via embeddings alcançou 97% de acurácia, superando a abordagem por métricas, que foi mais rápida, porém menos precisa. Os resultados demonstram o potencial da análise de embeddings para apoiar a identificação de código malicioso, contribuindo para a engenharia reversa de software.

Referências

Admass, W. S., Munaye, Y. Y., and Diro, A. A. (2024). Cyber security: State of the art, challenges and future directions. Cyber Security and Applications, 2:100031.

Andrade, G., Cirilo, E., Durelli, V., Cafeo, B., and Adachi, E. (2020). Data-flow analysis heuristic for vulnerability detection on configurable systems. In Anais do VIII Workshop de Visualização, Evolução e Manutenção de Software, pages 25–32, Porto Alegre, RS, Brasil.

Chawla, N. V., Bowyer, K. W., Hall, L. O., and Kegelmeyer, W. (2002). Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16:321–357.

Ding, I. (2021). iot-malware: Iot malware dataset. [link]. Accessed: 2025-07-01.

Ebrahim, F. and Joy, M. (2023). Source code plagiarism detection with pre-trained model embeddings and automated machine learning. In Mitkov, R. and Angelova, G., editors, Proceedings of the 14th International Conference on Recent Advances in Natural Language Processing, pages 301–309, Varna, Bulgaria. INCOMA Ltd., Shoumen, Bulgaria.

Jia, Y., Yu, Z., and Hong, Z. (2024). Semantic aware-based instruction embedding for binary code similarity detection. PLOS ONE, 19(6).

Kathuria, P., Aggarwal, V., and Gupta, D. (2021). A comprehensive investigation of computer-based and mobile-based malware, their countermeasures, and various detection methods. Computer Networks, 195:108–157.

Li, M., Wang, W., Wang, P., Wang, S., Wu, D., Liu, J., Xue, R., and Huo, W. (2017). Libd: Scalable and precise third-party library detection in android markets. In Proceedings of the 39th IEEE/ACM International Conference on Software Engineering (ICSE 2017), pages 335–346. IEEE.

Shalaginov, A., Banin, S., Dehghantanha, A., and Franke, K. (2018). Machine learning aided static malware analysis: A survey and tutorial. Computers Security, 80:41–60.

The OpenBSD Project (2025). Openbsd source tree. [link]. Accessed: 2025-07-01.

Votipka, D., Rabin, S. M., Micinski, K., Foster, J. S., and Mazurek, M. M. (2020). An observational investigation of reverse engineers’ processes. In Proceedings of the 29th USENIX Conference on Security Symposium, SEC’20.

vx-underground (2021). Malware source code collection. Accessed: 2025-07-04.