Use of TF-IDF in Data Comparison for Ransomware Detection

  • Augusto Parisot UFF
  • Lucila M. S. Bento UERJ
  • Raphael C. S. Machado UFF

Abstract


Ransomware attacks represent one of the most significant cyber threats faced by users and organizations worldwide. This paper employs the TF-IDF technique, widely used in natural language processing, to analyze data from dynamic analysis reports generated by the Cuckoo Sandbox. We compared various types of data to determine which are most effective in detecting this threat. In our evaluation, we explored preprocessing methods alongside classic machine learning algorithms. The results indicate that Random Forest and SVM, when processing String data with StandardScaler, achieved accuracies of up to 98%, proving to be the most effective approaches.

References

Al-rimy, B. A. S., Maarof, M. A., and Shaid, S. Z. M. (2019). Crypto-ransomware early detection model using novel incremental bagging with enhanced semi-random subspace selection. Future Generation Computer Systems, 101:476–491.

Begovic, K., Al-Ali, A., and Malluhi, Q. (2023). Cryptographic ransomware encryption detection: Survey. Computers & Security, 132:103349.

Benmalek, M. (2024). Ransomware on cyber-physical systems: Taxonomies, case studies, security gaps, and open challenges. Internet of Things and Cyber-Physical Systems, 4:186–202.

Black, P., Sohail, A., Gondal, I., Kamruzzaman, J., Vamplew, P., and Watters, P. (2020). Api based discrimination of ransomware and benign cryptographic programs. In International Conference on Neural Information Processing, pages 177–188. Springer.

Cen, M., Jiang, F., Qin, X., Jiang, Q., and Doss, R. (2024). Ransomware early detection: A survey. Computer Networks, 239:110138.

Chang, K., Zhao, N., and Kou, L. (2022). A survey on malware detection based on api calls. In 2022 9th International Conference on Dependable Systems and Their Applications (DSA), pages 464–471.

Chen, Q., Islam, S. R., Haswell, H., and Bridges, R. A. (2019). Automated ransomware behavior analysis: Pattern extraction and early detection. In International Conference on Science of Cyber Security, pages 199–214. Springer.

Dabas, N., Ahlawat, P., and Sharma, P. (2023). An effective malware detection method using hybrid feature selection and machine learning algorithms. Arabian Journal for Science and Engineering, 48(8):9749 – 9767.

Dinh, P. V., Shone, N., Dung, P. H., Shi, Q., Hung, N. V., and Ngoc, T. N. (2019). Behaviour-aware malware classification: Dynamic feature selection. In 2019 11th International Conference on Knowledge and Systems Engineering, pages 1–5. IEEE.

Faceli, K., Lorena, A. C., Gama, J., and Carvalho, A. C. P. d. L. F. d. (2021). Inteligência artificial: uma abordagem de aprendizado de máquina. LTC.

Freeman, D. and Chio, C. (2018). Machine Learning and Security: Protecting Systems with Data and Algorithms. O’Reilly Media.

Guarnieri, C., Tanasi, A., Bremer, J., and Schloesser, M. (2012). The cuckoo sandbox. Accessed: Dec, 16:2018.

Horowitz, M. (2023). Check point 2023 security report.

IBMSecurity (2023a). Cost of a data breach report 2023.

IBMSecurity (2023b). X-force threat intelligence index 2023.

IBMSecurity (2024). X-force threat intelligence index 2024.

Jones, K. S. (1972). A statistical interpretation of term specificity and its application in retrieval. Journal of documentation.

Kaspersky (2021). Ransomware double extortion and beyond: Revil, clop, and conti.

Kaspersrky (2021). Ataques de ransomware direcionados crescem 700%.

Kim, M. and Kim, H. (2024). A dynamic analysis data preprocessing technique for malicious code detection with tf-idf and sliding windows. Electronics, 13(5).

Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of research and development, 2(2):159–165.

Maniriho, P., Mahmood, A. N., and Chowdhury, M. J. M. (2024a). A systematic literature review on windows malware detection: Techniques, research issues, and future directions. Journal of Systems and Software, 209:111921.

Maniriho, P., Mahmood, A. N., and Chowdhury, M. J. M. (2024b). A systematic literature review on windows malware detection: Techniques, research issues, and future directions. Journal of Systems and Software, 209:111921.

Mohanta, A. and Saldanha, A. (2020). Malware Analysis and Detection Engineering: A Comprehensive Approach to Detect and Analyze Modern Malware. Springer.

Prachi., Dabas, N., and Sharma, P. (2023). Malanalyser: An effective and efficient windows malware detection method based on api call sequences. Expert Systems with Applications, 230:120756.

Qin, B., Zhang, J., and Chen, H. (2021). Malware detection based on tf-(idf&icf) method. Journal of Physics: Conference Series, 2024(1):012030.

Razaulla, S., Fachkha, C., Markarian, C., Gawanmeh, A., Mansoor, W., Fung, B. C. M., and Assi, C. (2023). The age of ransomware: A survey on the evolution, taxonomy, and research directions. IEEE Access, 11:40698–40723.

Singh, J. and Singh, J. (2021). A survey on machine learning-based malware detection in executable files. Journal of Systems Architecture, 112:101861.

Statcounter (2024). Desktop windows version market share worldwide: May 2023 - may 2024.

Team, T. I. (2023). 2023 state of ransomware.

Vajjala, S., Majumder, B., Gupta, A., and Surana, H. (2020). Practical Natural Language Processing: A Comp. Guide to Building Real-world NLP Systems. O’Reilly Media.

Vang-Mata, R. (2020). Multilayer Perceptrons: Theory and Applications. Computer Science, Technology and Applications Series. Nova Science Publishers.

Wold, S., Esbensen, K., and Geladi, P. (1987). Principal component analysis. Chemometrics and Intelligent Laboratory Systems, 2(1):37–52. Proceedings of the Multivariate Statistical Workshop for Geologists and Geochemists.

Zhang, H., Xiao, X., Mercaldo, F., Ni, S., Martinelli, F., and Sangaiah, A. K. (2019). Classification of ransomware families with machine learning based on n-gram of opcodes. Future Generation Computer Systems, 90:211–221.

Zhang, S., Du, T., Shi, P., Su, X., and Han, Y. (2023). Early detection and defense countermeasure inference of ransomware based on api sequence. International Journal of Advanced Computer Science and Applications, 14(10):632 – 641.
Published
2024-09-16
PARISOT, Augusto; BENTO, Lucila M. S.; MACHADO, Raphael C. S.. Use of TF-IDF in Data Comparison for Ransomware Detection. In: BRAZILIAN SYMPOSIUM ON CYBERSECURITY (SBSEG), 24. , 2024, São José dos Campos/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2024 . p. 678-693. DOI: https://doi.org/10.5753/sbseg.2024.240700.