Enhancing Malware Family Classification in the Microsoft Challenge Dataset via Transfer Learning

Marcelo Invert Palma Salas; Paulo De Geus; Marcus Botacin

Marcelo Invert Palma Salas UNICAMP
Paulo De Geus UNICAMP
Marcus Botacin Texas A&M University

Resumo

In recent years, malware developers have introduced new and advanced protection techniques against conventional signature-based and heuristic-based malware analysis techniques to avoid detection and removal by conventional antivirus. With the progress of deep learning, techniques such as Convolutional Neural Networks (CNN) are useful to detect the global structure of the code and to be able to decipher the patterns in the binary code datasets converted to RGB or grayscale images. This article takes advantage of the spatial structure of imaged malware by using a series of pre-trained Imagenet convolutions to generate feature maps that learn how to recognize and group malicious code into malware families. This research added a customized neural network on top of eight pre-trained networks (Xception, VGG16, VVG19, ResNet50, InceptionV3, MobileNet, MobileNetV2, and DenseNet169) to classify 10868 malware samples from the Microsoft Malware Classification Challenge dataset, achieving results close to 99% through the use of parameter adjustments and increasing the size of the dataset in order to generalize the model and reduce the risk of overfitting for malware that uses evasion techniques against classification.

Palavras-chave: malware classification, transfer learning, ResNet50, Convolutional Neural Networks, Xception, MobileNetV2, InceptionV3, MobileNet, VGG16, VVG19, DenseNet169