Evaluative Bias and Compromised Generalization: The Impact of Identical Samples in Android Malware Datasets
Abstract
In this work, we analyzed public datasets used for Android malware detection, investigating how identical samples and the groups they form impact the performance and generalization ability of machine learning models. Our tests across six scenarios show that identical samples artificially inflate performance metrics, creating a misleading impression of efficacy. Additionally, in datasets with few unique samples, we observed that models struggle to generalize to new data. We conclude that ensuring exclusive samples in the test set is essential for accurate evaluations and to avoid misleading conclusions about classifier capabilities.
Keywords:
Android Malware, Machine Learning, Identical Samples, Unique Samples, Performance Evaluation, Generalization
References
Zhao, Y. et al. On the Impact of Sample Duplication in Machine-Learning-Based Android Malware Detection. ACM Trans. Softw. Eng. Methodol., Association for Computing Machinery, New York, NY, USA, v. 30, n. 3, mai. 2021. ISSN 1049-331X. DOI: 10.1145/3446905.
Gaber, M. G.; Ahmed, M.; Janicke, H. Malware detection with artificial intelligence: A systematic literature review. ACM Computing Surveys, ACM New York, NY, v. 56, n. 6, p. 1–33, 2024.
Budach, L. et al. The effects of data quality on machine learning performance. arXiv preprint arXiv:2207.14529, 2022.
Barz, B.; Denzler, J. Do we train on test data? purging cifar of near-duplicates. Journal of Imaging, MDPI, v. 6, n. 6, p. 41, 2020.
Sarracino, F.; Mikucka, M. Estimation bias due to duplicated observations: a Monte Carlo simulation, 2016.
Huang, K. et al. Learning classifiers from imbalanced data based on biased minimax probability machine. In: IEEE. PROCEEDINGS of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. 2004. v. 2, p. ii–ii.
Allamanis, M. The adverse effects of code duplication in machine learning models of code. In: PROCEEDINGS of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 2019. P. 143–153.
Pendlebury, F. et al. {TESSERACT}: Eliminating experimental bias in malware classification across space and time. In: 28TH USENIX security symposium (USENIX Security 19). 2019. P. 729–746.
Alam, M. T.; Bhusal, D.; Rastogi, N. Revisiting Static Feature-Based Android Malware Detection. arXiv preprint arXiv:2409.07397, 2024.
Şahin, D. Ö. et al. A novel permission-based Android malware detection system using feature selection based on linear regression. Neural Computing and Applications, Springer, p. 1–16, 2023.
Mathur, A. et al. NATICUSdroid: A malware detection framework for Android using native and custom permissions. Journal of Information Security and Applications, Elsevier, v. 58, p. 102696, 2021.
Palumbo, P. et al. A pragmatic android malware detection procedure. Computers & Security, Elsevier, v. 70, p. 689–701, 2017.
Martín, A. et al. ADROIT: Android malware detection using meta-information. In: IEEE. 2016 IEEE Symposium Series on Computational Intelligence (SSCI). 2016. P. 1–8.
Sisto, A. AndroCrawl: studying alternative Android marketplaces. Politecnico di Milano, 2012.
Mahindru, A. Android permission dataset. Mendeley Data, v. 1, p. 2018, 2018.
Colaco, C. et al. Defensedroid: A modern approach to android malware detection. Strad Research, v. 8, n. 5, p. 271–282, 2021.
Yerima, S. Y.; Sezer, S. Droidfusion: A novel multilevel classifier fusion approach for android malware detection. IEEE transactions on cybernetics, IEEE, v. 49, n. 2, p. 453–466, 2018.
Guerra-Manzanares, A.; Bahsi, H.; Nõmm, S. Kronodroid: time-based hybrid-featured dataset for effective android malware detection and characterization. Computers & Security, Elsevier, v. 110, p. 102399, 2021.
Wang, W. et al. Constructing features for detecting android malicious applications: issues, taxonomy and directions. IEEE access, IEEE, v. 7, p. 67602–67631, 2019.
Powers, D. M. W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, v. 2, n. 1, p. 37–63, 2011.
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Information Processing & Management, Elsevier, v. 45, n. 4, p. 427–437, 2009.
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, Elsevier, v. 405, n. 2, p. 442–451, 1975.
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, BioMed Central, v. 21, n. 1, p. 6, 2020.
Gaber, M. G.; Ahmed, M.; Janicke, H. Malware detection with artificial intelligence: A systematic literature review. ACM Computing Surveys, ACM New York, NY, v. 56, n. 6, p. 1–33, 2024.
Budach, L. et al. The effects of data quality on machine learning performance. arXiv preprint arXiv:2207.14529, 2022.
Barz, B.; Denzler, J. Do we train on test data? purging cifar of near-duplicates. Journal of Imaging, MDPI, v. 6, n. 6, p. 41, 2020.
Sarracino, F.; Mikucka, M. Estimation bias due to duplicated observations: a Monte Carlo simulation, 2016.
Huang, K. et al. Learning classifiers from imbalanced data based on biased minimax probability machine. In: IEEE. PROCEEDINGS of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2004. CVPR 2004. 2004. v. 2, p. ii–ii.
Allamanis, M. The adverse effects of code duplication in machine learning models of code. In: PROCEEDINGS of the 2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software. 2019. P. 143–153.
Pendlebury, F. et al. {TESSERACT}: Eliminating experimental bias in malware classification across space and time. In: 28TH USENIX security symposium (USENIX Security 19). 2019. P. 729–746.
Alam, M. T.; Bhusal, D.; Rastogi, N. Revisiting Static Feature-Based Android Malware Detection. arXiv preprint arXiv:2409.07397, 2024.
Şahin, D. Ö. et al. A novel permission-based Android malware detection system using feature selection based on linear regression. Neural Computing and Applications, Springer, p. 1–16, 2023.
Mathur, A. et al. NATICUSdroid: A malware detection framework for Android using native and custom permissions. Journal of Information Security and Applications, Elsevier, v. 58, p. 102696, 2021.
Palumbo, P. et al. A pragmatic android malware detection procedure. Computers & Security, Elsevier, v. 70, p. 689–701, 2017.
Martín, A. et al. ADROIT: Android malware detection using meta-information. In: IEEE. 2016 IEEE Symposium Series on Computational Intelligence (SSCI). 2016. P. 1–8.
Sisto, A. AndroCrawl: studying alternative Android marketplaces. Politecnico di Milano, 2012.
Mahindru, A. Android permission dataset. Mendeley Data, v. 1, p. 2018, 2018.
Colaco, C. et al. Defensedroid: A modern approach to android malware detection. Strad Research, v. 8, n. 5, p. 271–282, 2021.
Yerima, S. Y.; Sezer, S. Droidfusion: A novel multilevel classifier fusion approach for android malware detection. IEEE transactions on cybernetics, IEEE, v. 49, n. 2, p. 453–466, 2018.
Guerra-Manzanares, A.; Bahsi, H.; Nõmm, S. Kronodroid: time-based hybrid-featured dataset for effective android malware detection and characterization. Computers & Security, Elsevier, v. 110, p. 102399, 2021.
Wang, W. et al. Constructing features for detecting android malicious applications: issues, taxonomy and directions. IEEE access, IEEE, v. 7, p. 67602–67631, 2019.
Powers, D. M. W. Evaluation: From precision, recall and F-measure to ROC, informedness, markedness and correlation. Journal of Machine Learning Technologies, v. 2, n. 1, p. 37–63, 2011.
Sokolova, M.; Lapalme, G. A systematic analysis of performance measures for classification tasks. Information Processing & Management, Elsevier, v. 45, n. 4, p. 427–437, 2009.
Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochimica et Biophysica Acta (BBA)-Protein Structure, Elsevier, v. 405, n. 2, p. 442–451, 1975.
Chicco, D.; Jurman, G. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics, BioMed Central, v. 21, n. 1, p. 6, 2020.
Published
2024-11-27
How to Cite
CANTO, Gabriel Sousa; ROCHA, Vanderson; KREUTZ, Diego; BRAGANÇA, Hendrio; FEITOSA, Eduardo.
Evaluative Bias and Compromised Generalization: The Impact of Identical Samples in Android Malware Datasets. In: REGIONAL SCHOOL OF COMPUTER NETWORKS (ERRC), 21. , 2024, Rio Grande/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 177-182.
DOI: https://doi.org/10.5753/errc.2024.4688.