Feature selection and comparison of classifiers for predicting protein class

Authors

  • Bruno C. Santos Pontifical Catholic University of Minas Gerais
  • Cora Silberschneider Pontifical Catholic University of Minas Gerais
  • Marcos W. Rodrigues Pontifical Catholic University of Minas Gerais
  • Cristiano L. N. Pinto Pontifical Catholic University of Minas Gerais
  • Cristiane N. Nobre Pontifical Catholic University of Minas Gerais
  • Luis E. Zárate Pontifical Catholic University of Minas Gerais

DOI:

https://doi.org/10.5753/jidm.2019.2034

Keywords:

Feature Selection, Classifiers, Multi-Objective Genetic Algorithm, Protein Prediction

Abstract

Knowing the function of proteins is essential for understanding several biological systems. The experiments in laboratory to determine protein class are costly and require a long time to be done. Therefore, it is necessary to provide efficient computational models to identify the class to which a protein belongs. Nowadays, a significant volume of information regarding proteins and their structure is continually being made available in public data repositories. For example, the STING_DB database has a lot of information extracted from all protein structural levels (primary, secondary, tertiary, and quaternary), which are frequently used in classification models for this type of problem. However, it is unknown which physical-chemical properties are the most relevant ones to contribute to the prediction of the class. Therefore, there is a need to identify the subset of more suitable properties. In this work, we propose an approach based on a multi-objective genetic algorithm with the classifier k-NN to select the best physical-chemical properties. Our strategy uses a multi-objective genetic algorithm to obtain a smaller subset of features that contribute significantly to the prediction problem. To improve the prediction’s performance, we choose to perform a post enrichment process, then we compare the performance of our methodology with several classifiers: ANN, SVM, Random Forest, and k-NN. Our method achieved an average F-measure value of 70.22% with the Random Forest classifier. Finally, a comparative analysis, with statistical significance, shows the relevance of our approach in relation to other methodologies.

Downloads

Download data is not yet available.

References

W. Aha, D. Kibler, and M. K. Albert. Instance-based learning algorithms. Machine Learning, 6: 37–66, 01 1991. doi: 10.1023/A:1022689900470.

N. Ahmed, T. Natarajan, and K. R. Rao. Discrete cosine transform. Computers, IEEE Transactions on, C-23:90–93, 1974.

B. Alberts, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter. Molecular Biology of the Cell. Garland Science, 5 edition, Nov. 2007. ISBN 0815341059. URL [link].

B. Alberts, D. Bray, K. Hopkin, A. Johnson, J. Lewis, M. Raff, K. Roberts, and P. Walter. Essential Cell Biology. CRC Press, 2013. ISBN 9781317806271. URL [link].

J. Bergstra and Y. Bengio. Random search for hyper-parameter optimization. J. Mach. Learn. Res., 13(null):281–305, Feb. 2012. ISSN 1532-4435.

H. M. Berman, J. Westbrook, Z. Feng, G. Gililand, T. N. Bhat, H. Weissig, I. N. Shindyalov, and P. E. Bourne. The protein data bank. Nucleic Acids Research, 28:235–242, 2000.

L. C. Borro, S. R. de Medeiros Oliveira, M. E. B. yamagishi, A. L. Mancini, J. G. Jardine, I. Mazoni, E. H. do Santos, R. H. Higa, P. R. K. Falcão, and G. Neshich. Predictiong enzyme class from protein structure using bayesian classification. Genetic and Molecular Research, 1:193–202, 2006.

L. Breiman. Random forests. Machine Learning, 45(1):5–32, Oct 2001. ISSN 1573-0565. doi: 10. 1023/A:1010933404324. URL https://doi.org/10.1023/A:1010933404324.

N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer. Smote: Synthetic minority oversampling technique. J. Artif. Int. Res., 16(1):321–357, June 2002. ISSN 1076-9757. URL [link].

F. Chollet et al. Keras. [link], 2015.

I. N. da Silva, D. H. Spatti, R. A. Flauzino, L. H. B. Liboni, and S. F. dos Reis Alves. Artificial Neural Networks: A Practical Course. Springer Publishing Company, Incorporated, 1st edition, 2016. ISBN 3319431617.

K. Deb, A. Pratap, S. Agarwal, and T. Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii. Trans. Evol. Comp, 6(2):182–197, Apr. 2002. ISSN 1089-778X. doi: 10.1109/4235.996017. URL http://dx.doi.org/10.1109/4235.996017.

P. D. Dobson and A. J. Doig. Distinguishing enzyme structures from non-enzymes without alignments. Molecular Biology, 330:771–783, 2003.

P. D. Dobson and A. J. Doig. Predicting enzyme class from protein structure without alignments. Molecular Biology, 345:187–199, 2004.

H. P. Graf, E. Cosatto, L. Bottou, I. Dourdanovic, and V. Vapnik. Parallel support vector machines: The cascade svm. In Advances in neural information processing systems, pages 521–528, 2004.

R. H. R. Hahnloser, R. Sarpeshkar, M. A. Mahowald, R. J. Douglas, and H. S. Seung. Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature, 405:947–951, 2000.

S. Haykin. Redes Neurais - 2ed. Bookman, 2001. ISBN 9788573077186. URL [link].

H. He, Y. Bai, E. A. Garcia, and S. Li. Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), pages 1322–1328, June 2008. doi: 10.1109/IJCNN.2008.4633969.

M. A. Hearst. Support vector machines. IEEE Intelligent Systems, 13(4):18–28, July 1998. ISSN 1541-1672. doi: 10.1109/5254.708428. URL http://dx.doi.org/10.1109/5254.708428.

T. K. Ho. Random decision forests. In Proceedings of the Third International Conference on Document Analysis and Recognition (Volume 1) - Volume 1, ICDAR ’95, pages 278–, Washington, DC, USA, 1995. IEEE Computer Society. ISBN 0-8186-7128-9. URL [link].

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. pages 1137–1143. Morgan Kaufmann, 1995.

Z. Kovacs. Redes Neurais Artificiais - Fundamentos e Aplicações. Saraiva, 2002.

C. Kumar and A. Choudhary. A top-down approach to classify enzyme functional classes and sub-classes using random forest. EURASIP Journal on Bioinformatics and Systems Biology, 2012(1): 1, Feb 2012. ISSN 1687-4153. doi: 10.1186/1687-4153-2012-1. URL https://doi.org/10.1186/1687-4153-2012-1.

V. Kůrková. Kolmogorov’s theorem and multilayer neural networks. Neural Networks, 5(3):501 – 506, 1992. ISSN 0893-6080. doi: https://doi.org/10.1016/0893-6080(92)90012-8. URL [link].

L. F. Leijoto, T. Assis De Oliveira Rodrigues, L. Zarate, and C. Nobre. A genetic algorithm for the selection of features used in the prediction of protein function. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on, pages 168–174. Computer Society Digital Library, Nov 2014. doi: 10.1109/BIBE.2014.42.

M. Li and P. M. Vitányi. Chapter 4 - kolmogorov complexity and its applications. In J. V. LEEUWEN, editor, Algorithms and Complexity, Handbook of Theoretical Computer Science, pages 187 – 254. Elsevier, Amsterdam, 1990. ISBN 978-0-444-88071-0. doi: https://doi.org/10.1016/B978-0-444-88071-0.50009-6.

A. H. Liu and A. Califano. Functional classification of proteins by pattern discovery and top-down clustering of primary sequences. IBM Systems Journal, 40(2):379–393, 2001. doi: 10.1147/sj.402.0379.

A. L. Mancini, R. H. Higa, A. Oliveira, F. Dominiquini, P. R. Kuser, M. E. B. Yamagishi, R. C. Togawa, and G. Neshich. Sting contacts: a web-based application for identification and analysis of amino acid contacts within protein structure and across protein interfaces. Bioinformatics, 20: 2145–2147, 2004.

N. Nadzirin and M. Firdaus-Raih. Proteins of unknown function in the protein data bank (pdb): an inventory of true uncharacterized proteins and computational tools for their analysis. International journal of molecular sciences, 13(10):12761–12772, Oct 2012. ISSN 1422-0067. doi: 10.3390/ijms131012761. URL [link].

S. Nasreen. A survey of feature selection and feature extraction techniques in machine learning, sai, 2014. 08 2014.

G. Pandey, V. Kumar, and M. Steinbach. Computational approaches for protein function prediction: A survey. Twin Cities: Department of Computer Science and Engineering, University of Minnesota, 01 2006.

F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and E. Duchesnay. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830, 2011.

D. E. Pires, R. C. de Melo-Minardi, M. A. dos Santos, C. H. da Silveira, M. M. Santoro, and W. Meira. Cutoff scanning matrix (csm): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics, 12(4):S12, 2011. ISSN 1471-2164. doi:10.1186/1471-2164-12-S4-S12.

W. K. Resende, R. A. Nascimento, C. R. Xavier, I. F. Lopes, and C. N. Nobre. The use of support vector machine and genetic algorithms to predict protein function. In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pages 1773–1778, Oct 2012. doi: 10.1109/ICSMC.2012.6377994.

B. C. Santos, S. C., M. W. Rodrigues, C. N. Nobre, and L. E. Zárate. Seleção de características utilizando algoritmo genético multiobjetivo e k -nn para predição de função de proteína. In 6th Symposium on Knowledge Discovery, Mining and Learning, pages 36–43, São Paulo, Brazil, 2018a. Bracis 2018. [link].

B. C. Santos, C. N. Nobre, and L. E. Zárate. Multi-objective genetic algorithm for feature selection in a protein function prediction context. In 2018 IEEE Congress on Evolutionary Computation (CEC), pages 1–6, July 2018b. doi: 10.1109/CEC.2018.8477981.

G. O. Santos, C. N. Nobre, and L. E. Zárate. Biological characteristics evaluation to predict enzyme classes with support vector. International Journal of Bioinformatics Research and Applications, 2018c. (To be published [link]).

B. Szalkai and V. Grolmusz. Near perfect protein multi-label classification with deep neural networks. Methods, 132:50 – 56, 2018. ISSN 1046-2023. doi: https://doi.org/10.1016/j.ymeth.2017.06.034. URL [link]. Comparison and Visualization Methods for High-Dimensional Biological Data.

O. K. Tawfik and D. S. Enzyme promiscuity: A mechanistic and evolutionary perspective. Annual Review of Biochemistry, 79(1):471–505, 2010. doi: 10.1146/annurev-biochem-030409-143718. URL https://doi.org/10.1146/annurev-biochem-030409-143718. PMID: 20235827.

A. Wicaksono and A. Afif. Hyper parameter optimization using genetic algorithm on machine learning methods for online news popularity prediction. International Journal of Advanced Computer Science and Applications, 9, 01 2018. doi: 10.14569/IJACSA.2018.091238.

Downloads

Published

2019-12-30

How to Cite

Santos, B. C., Silberschneider, C., Rodrigues, M. W., Pinto, C. L. N., Nobre, C. N., & Zárate, L. E. (2019). Feature selection and comparison of classifiers for predicting protein class. Journal of Information and Data Management, 10(3), 146 –. https://doi.org/10.5753/jidm.2019.2034

Issue

Section

KDMILE 2018