Comparative Analysis of Supervised Algorithms for Protein Cluster Classification Using k-mer Image Embeddings

Giovanna A. P. Soares; Hannah I. da S. Marques; Matheus Dalmolin; Raquel de M. Barbosa; Marcelo A. C. Fernandes

doi:10.5753/bsb.2025.14606

Giovanna A. P. Soares Universidade Federal do Rio Grande do Norte (UFRN)
Hannah I. da S. Marques Universidade Federal do Rio Grande do Norte (UFRN)
Matheus Dalmolin Universidade Federal do Rio Grande do Norte (UFRN)
Raquel de M. Barbosa Granada University
Marcelo A. C. Fernandes Universidade Federal do Rio Grande do Norte (UFRN)

DOI: https://doi.org/10.5753/bsb.2025.14606

Resumo

The rapid expansion of protein sequence databases requires effective computational strategies for accurate classification and functional annotation. This work presents a supervised learning framework for protein cluster classification using vector embeddings derived from k-mer image representations. Four supervised algorithms were systematically compared using five-fold stratified cross-validation: Logistic Regression (L2), Random Forest, k-Nearest Neighbors (kNN), and XGBoost. Embeddings extracted from k-mer images served as input features, enabling alignment-free and scalable classification. On the UniRef100 dataset, all models achieved strong performance. Logistic Regression obtained an accuracy of 98.1% and macro F1-score of 0.981, while Random Forest, kNN, and XGBoost achieved even higher accuracies of 99.7%, 99.4%, and 99.8%, respectively. XGBoost presented the best overall results, with an accuracy of 99.85%, F1-score of 0.9985, AUC of 1.000, and the lowest log loss (0.0071). For UniRef90, a more heterogeneous and challenging dataset, a decrease in accuracy and F1 was observed for all methods. Logistic Regression achieved 93.1% accuracy and F1-score of 0.931, while Random Forest, kNN, and XGBoost obtained accuracies of 99.3%, 98.6%, and 99.6%, respectively. Once again, XGBoost showed the best results for UniRef90, with an accuracy of 99.56%, F1-score of 0.9957, AUC of 0.9999, and log loss of 0.0194. The confusion matrices for both datasets indicate that most protein clusters were correctly classified, with only minor misclassifications among the most challenging classes. These findings demonstrate the effectiveness of embedding-based representations and tree-based ensemble methods for robust and interpretable protein cluster classification, even in more complex and diverse datasets.

Palavras-chave: Protein Classification, Supervised Machine Learning, Bioinformatics, K-mers, Embeddings, Vision Transformer

Referências

Ahmed, B., Haque, M. A., Iquebal, M. A., Jaiswal, S., Angadi, U., Kumar, D., and Rai, A. (2023). Deepaprot: Deep learning based abiotic stress protein sequence classification and identification tool in cereals. Frontiers in plant science, 13:1008756.

Balamurugan, R., Mohite, S., and Raja, S. (2023). Protein sequence classification using bidirectional encoder representations from transformers (bert) approach. SN Computer Science, 4(5):481.

Blum, M., Andreeva, A., Florentino, L., Chuguransky, S., Grego, T., Hobbs, E., Pinto, B., Orr, A., Paysan-Lafosse, T., Ponamareva, I., Salazar, G., Bordin, N., Bork, P., Bridge, A., Colwell, L., Gough, J., Haft, D., Letunic, I., Llinares-López, F., Marchler-Bauer, A., Meng-Papaxanthos, L., Mi, H., Natale, D., Orengo, C., Pandurangan, A., Piovesan, D., Rivoire, C., Sigrist, C. A., Thanki, N., Thibaud-Nissen, F., Thomas, P., Tosatto, S. E., Wu, C., and Bateman, A. (2024). Interpro: the protein sequence classification resource in 2025. Nucleic Acids Research, 53(D1):D444–D456.

Coutinho, M. G. F., Câmara, G. B. M., Barbosa, R. d. M., and Fernandes, M. A. C. (2023). Sars-cov-2 virus classification based on stacked sparse autoencoder. Computational and Structural Biotechnology Journal, 21:284–298.

Câmara, G. B. M., Coutinho, M. G. F., Silva, L. M. D. d., Gadelha, W. V. d. N., Torquato, M. F., Barbosa, R. d. M., and Fernandes, M. A. C. (2022). Convolutional neural network applied to sars-cov-2 sequence classification. Sensors, 22(15):5730.

De Souza, J. G., Fernandes, M. A., and de Melo Barbosa, R. (2022). A novel deep neural network technique for drug–target interaction. Pharmaceutics, 14(3):625.

Lilhore, U. K., Simiaya, S., Alhussein, M., Faujdar, N., Dalal, S., and Aurangzeb, K. (2024). Optimizing protein sequence classification: integrating deep learning models with bayesian optimization for enhanced biological analysis. BMC Medical Informatics and Decision Making, 24(1):236.

Liu, G. (2024). Hybrid random forest and support vector machine model for protein sequence classification. In 2024 5th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT), pages 1120–1124.

Luo, Y. and Cai, J. (2024). Deep learning in proteomics informatics: Applications, challenges, and future directions. arXiv preprint arXiv:2412.17349.

Mall, R., Kaushik, R., Martinez, Z. A., Thomson, M. W., and Castiglione, F. (2025). Benchmarking protein language models for protein crystallization. Scientific Reports, 15(1):2381.

Murad, T., Ali, S., Chourasia, P., Mansoor, H., and Patterson, M. (2023). Circular arc length-based kernel matrix for protein sequence classification. In 2023 IEEE International Conference on Big Data (BigData), pages 1429–1437.

Perveen, H. and Weeds, J. (2025). Protein sequence classification using natural language processing techniques. Discover Artificial Intelligence, 5(1):1–25.

Suyunu, B., Dolu, Ö., and Özgür, A. (2025). evobpe: Evolutionary protein sequence tokenization. arXiv preprint arXiv:2503.08838.

Tasnim, F., Habiba, S. U., Mahmud, T., Nahar, L., Hossain, M. S., and Andersson, K. (2024). Protein sequence classification through deep learning and encoding strategies. Procedia Computer Science, 238:876–881.

Wang, Y., Zhang, Y., Zhan, X., He, Y., Yang, Y., Cheng, L., and Alghazzawi, D. (2024). Machine learning for predicting protein properties: A comprehensive review. Neurocomputing, 597:128103.

Zhang, M., Wan, F., and Liu, T. (2023). Drugfinder: Druggable protein identification model based on pre-trained models and evolutionary information. Algorithms, 16(6).