MH-1M: One of The Most Comprehensive and Up-to-Date Dataset for Advanced Android Malware Detection
Resumo
We introduce MH-1M, one of the most comprehensive and up-to-date dataset for advanced Android malware research. This dataset includes 1,340,515 applications, covering diverse features and extensive sets of metadata. For precise malware assessment, we utilize the VirusTotal API, integrating multiple detection methods to ensure reliable outcomes. Our GitHub repository offers users access to the processed dataset and associated metadata, totaling over 400GB. This includes comprehensive outputs from the feature extraction process and VirusTotal metadata files. Our findings underscore the important role of the MH-1M dataset as an invaluable resource for understanding the evolving landscape of malware.
Referências
AI & Data Today (2023). Top 10 reasons why ai projects fail. [link].
Arp, D., Spreitzenbarth, M., Hubner, M., Gascon, H., Rieck, K., and Siemens, C. (2014). Drebin: Effective and explainable detection of android malware in your pocket. In NDSS, volume 14.
Botacin, M., Ceschin, F., Sun, R., Oliveira, D., and Grégio, A. (2021). Challenges and pitfalls in malware research. Computers & Security, 106:102287.
Bragança, H., Rocha, V., Barcellos, L. V., Souto, E., Kreutz, D., and Feitosa, E. (2023). Capturing the Behavior of Android Malware with MH-100K: A Novel and Multidimensional Dataset. In Anais do XXIII SBSeg, pages 510–515. SBC.
Bragança, H., Rocha, V., Souto, E., Kreutz, D., and Feitosa, E. (2023). Explaining the Effectiveness of Machine Learning in Malware Detection: Insights from Explainable AI. In Anais do XXIII SBSeg, Porto Alegre, RS, Brasil. SBC.
Bragança, H. et. al. (2024). MH-1M. [link].
Kumar, A. and Sharma, I. (2023). Understanding the behaviour of android ransomware attacks with real smartphones dataset. In ICONAT, pages 1–5. IEEE.
Miranda, T. C., Gimenez, P.-F., Lalande, J.-F., Tong, V. V. T., and Wilke, P. (2022). Debiasing android malware datasets: How can i trust your results if your dataset is biased? IEEE Transactions on Information Forensics and Security, 17:2182–2197.
Rocha, V., Assolin, J., Bragança, H., Kreutz, D., and Feitosa, E. (2023). AMGenerator e AM-Explorer: Geração de Metadados e Construção de Datasets Android. In Anais Estendidos do XXIII SBSeg, pages 41–48, Porto Alegre, RS, Brasil. SBC.
Scalas, M. et al. (2021). Malware analysis and detection with explainable machine learning. UNICA Institutional Research Information System.
Schmelzer, R. (2022). The one practice that is separating the AI successes from the failures. Forbes. [link].
Shwartz-Ziv, R. and Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90.
Taheri, L., Abdulkadir, A. F., and Lashkari, A. H. (2019). Investigation of the android malware (cic-invesandmal2019). [link].
Yerima, S. (2018). Android malware dataset for machine learning 2. [link].
Zakeya, N., Ségla, K., Chamseddine, T., and Alvine, B. B. (2022). Probing androvul dataset for studies on android malware classification. Journal of King Saud University-Computer and Information Sciences, 34(9):6883–6894.