Feature Selection Investigation in Machine Learning Docking Scoring Functions


The in silico evaluation of small molecules (ligands) and receptors (proteins) interactions is of great importance, especially in Drug Design. This is one of the principal computational methodologies that can be incorporated into the process of proposing new drugs, with the aim of reducing the high financial costs and time involved. In this context, molecular docking is a computer simulation procedure used to predict the best conformation and orientation of a ligand in the binding site of a target protein. These docking algorithms evaluate the protein-ligand complex interactions using scoring functions (SF). SF computationally quantify the complex binding affinity and can be divided into categories according to the methodology applied in their development: Physics-based, Empirical, Knowledge-based and Machine Learning. Machine Learning (ML) scoring functions train the SF considering features obtained from known protein-ligand complexes and experimental affinities. These SF rely heavily on the set of attributes that are used to train them. Thus, in this work, we use PCA, ANOVA and Random Forest to investigate how these feature selection methods impact the performance of three Machine Learning scoring functions trained with Support Vector Machines, Elastic Net Regularization and Neural Networks algorithms. The results show that Neural Networks can greatly benefit from Feature selection performed by Random Forests but not from ANOVA and PCA. The conclusions are that Feature selection can improve the results of regression and in this study Neural Networks combined with Random Forest is the best option.

Palavras-chave: Rational Drug Design, Molecular Docking, Machine Learning, Feature Selection, Scoring Functions


Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst. Technol. 2, 27:1–27:27 (2011)

Cock, P.J.A., et al.: Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25(11), 1422–1423 (2009)

Crampon, K., Giorkallos, A., Deldossi, M., Baud, S., Steffenel, L.A.: Machine-learning methods for ligand-protein molecular docking. Drug Discovery Today 27(1), 151–164 (2022)

Durrant, J.D., McCammon, J.A.: NNScore: a neural-network-based scoring function for the characterization of protein-ligand complexes. J. Chem. Inf. Model. 50(10), 1865–1871 (2010)

Durrant, J.D., McCammon, J.A.: BINANA: a novel algorithm for ligand-binding characterization. J. Mol. Graph. Model. 29(6), 888–893 (2011)

Eberhardt, J., Santos-Martins, D., Tillack, A.F., Forli, S.: AutoDock vina 1.2. 0: new docking methods, expanded force field, and python bindings. J. Chem. Inf. Model. 61(8), 3891–3898 (2021)

Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. J. Mach. Learn. Res. 3, 1157–1182 (2003)

Han, J., Pei, J., Tong, H.: Data Mining: Concepts and Techniques. Morgan Kaufmann (2022)

Hans, C.: Elastic net regression modeling with the orthant normal prior. J. Am. Stat. Assoc. 106(496), 1383–1393 (2011)

Ishwaran, H., Lu, M.: Standard errors and confidence intervals for variable importance in random forest regression, classification, and survival. Stat. Med. 38(4), 558–582 (2019)

Kabsch, W., Sander, C.: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 22(12), 2577–2637 (1983). https://doi.org/10.1002/bip.360221211. [link]

Kumar, M., Rath, N.K., Swain, A., Rath, S.K.: Feature selection and classification of microarray data using MapReduce based ANOVA and K-nearest neighbor. Procedia Comput. Sci. 54, 301–310 (2015)

Kundu, I., Paul, G., Banerjee, R.: A machine learning approach towards the prediction of protein-ligand binding affinity based on fundamental molecular properties. RSC Adv. 8(22), 12127–12137 (2018)

Kuntz, I.D.: Structure-based strategies for drug design and discovery. Science 257(5073), 1078–1082 (1992)

Landrum, G.: RDKit documentation. Release 1(1–79), 4 (2013)

Li, Y., et al.: Comparative assessment of scoring functions on an updated benchmark: 1. Compilation of the test set. J. Chem. Inf. Model. 54(6), 1700–1716 (2014)

Liu, J., Wang, R.: Classification of current scoring functions. J. Chem. Inf. Model. 55(3), 475–482 (2015)

Liu, Z., et al.: Forging the basis for developing protein-ligand interaction scoring functions. Acc. Chem. Res. 50(2), 302–309 (2017)

Lybrand, T.P.: Ligand-protein docking and rational drug design. Curr. Opin. Struct. Biol. 5(2), 224–228 (1995)

Mahapatra, M.K., Karuppasamy, M.: Fundamental considerations in drug design. In: Computer Aided Drug Design (CADD): From Ligand-Based Methods to Structure-Based Approaches, pp. 17–55. Elsevier (2022)

Morris, G.M., et al.: AutoDock4 and AutoDockTools4: automated docking with selective receptor flexibility. J. Comput. Chem. 30(16), 2785–2791 (2009)

Onodera, K., Satou, K., Hirota, H.: Evaluations of molecular docking programs for virtual screening. J. Chem. Inf. Model. 47(4), 1609–1618 (2007)

Pearson, K.: Principal components analysis. London Edinburgh Dublin Philosophical Mag. J. Sci. 6(2), 559 (1901)

Pedregosa, F., et al.: scikit-learn: machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

Piñero, J., Furlong, L.I., Sanz, F.: In silico models in drug development: where we are. Curr. Opin. Pharmacol. 42, 111–121 (2018)

Su, M., et al.: Comparative assessment of scoring functions: the CASF-2016 update. J. Chem. Inf. Model. 59(2), 895–913 (2018)

Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining. Pearson (2016)

Trott, O., Olson, A.J.: AutoDock vina: improving the speed and accuracy of docking with a new scoring function, efficient optimization, and multithreading. J. Comput. Chem. 31(2), 455–461 (2010)

Wang, C., Zhang, Y.: Improving scoring-docking-screening powers of protein-ligand scoring functions using random forest. J. Comput. Chem. 38(3), 169–177 (2017)

Wang, R., Fang, X., Lu, Y., Wang, S.: The PDBbind database: collection of binding affinities for protein-ligand complexes with known three-dimensional structures. J. Med. Chem. 47(12), 2977–2980 (2004)

Wang, S.C.: Artificial neural network. In: Interdisciplinary Computing in Java Programming, pp. 81–100. Springer, Boston (2003). https://doi.org/10.1007/978-1-4615-0377-4_5

Yang, C., Chen, E.A., Zhang, Y.: Protein-ligand docking in the machine-learning era. Molecules 27(14), 4568 (2022)

Yap, C.W.: PaDEL-descriptor: an open source software to calculate molecular descriptors and fingerprints. J. Comput. Chem. 32(7), 1466–1474 (2011)
BALBON, Maurício Dorneles Caldeira; ARRUDA, Oscar Emilio; WERHLI, Adriano V.; MACHADO, Karina dos Santos. Feature Selection Investigation in Machine Learning Docking Scoring Functions. In: SIMPÓSIO BRASILEIRO DE BIOINFORMÁTICA (BSB), 16. , 2023, Curitiba/PR. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 58-69. ISSN 2316-1248.