Seleção de características utilizando Algoritmo Genético multiobjetivo e k-NN para predição de função de proteína

  • Bruno C. Santos PUC-MG
  • Cora Silberschneider PUC-MG
  • Marcos W. Rodrigues PUC-MG
  • Cristiane N. Nobre PUC-MG
  • Luis E. Zárate PUC-MG

Resumo


The knowledge of a protein function is essential in many areas, such as bioinformatics, agriculture, and others. Therefore, it is necessary to provide efficient computational models that aim to find the function of a protein. Currently, there is a wealth of available information about protein, such as data from primary, secondary, tertiary and quaternary structures. One of the repositories that provide this information is the Sting DB, which has physicochemical information of the proteins, used by several authors. Our work proposes a methodology using the multiobjective genetic algorithm with non-parametric method k-NN during its genetic evolution, aiming to search the best subset of physical-chemical characteristics for the identification of protein classes. After that, we added new variables and applied PCA to the identified subset, to improve the classification process. In this step, we use the SVM due to its better performance with high dimensionalities data. The proposed methodology demonstrated accuracy values of 72.9% and an f-measure of 68.3%; also we gained about 90% efficiency in processing our approach compared to the previous model, allowing to add new attributes in an attempt to improve the prediction of protein function for future works.
Palavras-chave: Feature Selection, k-Nearest Neighbor, Multi-Objective Genetic Algorithm, Protein Prediction

Referências

Ahmed, N., Natarajan, T., and Rao, K. R. Discrete cosine transform. Computers, IEEE Transactions on vol. C-23, pp. 90–93, 1974.

Berman, H. M., Westbrook, J., Feng, Z., Gililand, G., Bhat, T. N., Weissig, H., Shindyalov, I. N., and Bourne, P. E. The protein data bank. Nucleic Acids Research vol. 28, pp. 235–242, 2000.

Borro, L. C., de Medeiros Oliveira, S. R., yamagishi, M. E. B., Mancini, A. L., Jardine, J. G., Mazoni, I., do Santos, E. H., Higa, R. H., Falcão, P. R. K., and Neshich, G. Predictiong enzyme class from protein structure using bayesian classification. Genetic and Molecular Research vol. 1, pp. 193–202, 2006.

Dobson, P. D. and Doig, A. J. Predicting enzyme class from protein structure without alignments. Molecular Biology vol. 345, pp. 187–199, 2004.

Fortin, F.-A., De Rainville, F.-M., Gardner, M.-A. G., Parizeau, M., and Gagné, C. Deap: Evolutionary algorithms made easy. J. Mach. Learn. Res. 13 (1): 2171–2175, July, 2012.

Lehninger, A., Nelson, D. L., and Cox, M. M. Lehninger Principles of Biochemistry. W. H. Freeman, 2004.

Leijoto, L., Assis De Oliveira Rodrigues, T., Zarate, L., and Nobre, C. A genetic algorithm for the selection of features used in the prediction of protein function. In Bioinformatics and Bioengineering (BIBE), 2014 IEEE International Conference on. pp. 168–174, 2014.

Mancini, A. L., Higa, R. H., Oliveira, A., Dominiquini, F., Kuser, P. R., Yamagishi, M. E. B., Togawa, R. C., and Neshich, G. Sting contacts: a web-based application for identification and analysis of amino acid contacts within protein structure and across protein interfaces. Bioinformatics vol. 20, pp. 2145–2147, 2004.

Moraes, F. R., Neshich, I. A. P., Mazoni, I., Yano, I. H., Pereira, J. G. C., Salim, J. A., Jardine, J. G., and Neshich, G. Improving predictions of protein-protein interfaces by combining amino acid-specific classifiers based on structural and physicochemical descriptors with their weighted neighbor averages. Plos One 9 (1): 1–15, 2014.

Nadzirin, N. and Firdaus-Raih, M. Proteins of unknown function in the protein data bank (pdb): An inventory of true uncharacterized proteins and computational tools for their analysis. International Journal of Molecular Sciences 13 (10): 12761–12772, 2012.

Neshich, G., Rocchia, W., Mancini, A. L., Yamagishi, M. E. B., Kuser, P. R., Fileto, R., Baudet, C., Pinto, I. P., Montagner, A. J., Palandrani, J. F., Krauchenco, J. N., Torres, R. C., Souza, S., Togawa, R. C., and Higa, R. H. Javaprotein dossier: a novel web-based data visualization tool for comprehensive analysis of protein structure. Nucleic Acids Research vol. 32, pp. W595–W601, 2004.

Pires, D. E., de Melo-Minardi, R. C., dos Santos, M. A., da Silveira, C. H., Santoro, M. M., and Meira, W. Cutoff scanning matrix (csm): structural classification and function prediction by protein inter-residue distance patterns. BMC Genomics 12 (4): S12, 2011.

Santos, B. C., Nobre, C. N., and Zarate, L. E. Multi-objective genetic algorithm for feature selection in a protein function prediction context. In IEEE Congress on Evolutionary Computation (CEC), 2018. (in press).

Santos, G. T. d. O. Avaliação de características para predição de classes de enzimas com Support Vector Machine. M.S. thesis, Pontifícia Universidade Católica de Minas Gerais, 2016.

Yao, Z. and Ruzzo, W. L. A regression-based k nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC Bioinformatics 7 (1): S11, Mar, 2006.
Publicado
22/10/2018
SANTOS, Bruno C.; SILBERSCHNEIDER, Cora; RODRIGUES, Marcos W.; NOBRE, Cristiane N.; ZÁRATE, Luis E.. Seleção de características utilizando Algoritmo Genético multiobjetivo e k-NN para predição de função de proteína. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 6. , 2018, São Paulo/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 25-32. ISSN 2763-8944. DOI: https://doi.org/10.5753/kdmile.2018.27381.