Exploring conditional missing patterns for automated bacteria identification using MALDI-TOF MS data

  • J. C. F. da Rocha UEPG
  • A. Campos Jr. UEPG
  • R. M. Etto UEPG
  • C. W. Galvão UEPG
  • G. L. Fedacz UEPG
  • R. R. da Silva UEPG
  • A. S. S. Oliveira UEPG

Resumo


A aprendizagem de classificadores para identificação automática de bactérias a partir fingerprints de espectrometria MALDI-TOF requer o tratamento de conjuntos de dados incompletos cuja ausência dos dados é condicional à hipótese de classificação (CMP). CMP é um padrão de perda não-aleatória (MNAR) que fornece evidencias para classificação. Uma estratégia para tratar o CMP é aplicar a estratificação de características. Considerando isto, este trabalho avaliou a eficácia da estratificação no treinamento de classificadores naive Bayes com a realização de dois experimentos. O primeiro, comparou o desempenho preditivo de classificadores categóricos, treinados sobre dados estratificados, com o desempenho de classificadores Gaussianos treinados em dados previamente imputados. O segundo experimento estimou o impacto do desbalanceamento de classe na diferença dos desempenhos dos classificadores Gaussianos e categóricos. Os resultados da ANOVA sugere que a estratificação de características induz a aprendizagem de classificadores mais acurados. A análise de correlção mostrou que o desbalanceamento de classes teve pouca influência sobre a diferença no desempenho dos classificadores.

Referências

Ashfaq, M., Da’na, D., and Al-Ghouti, M. (2022). Application of maldi-tof ms for identification of environmental bacteria: A review. Journal of Envir. Manag., 305:114359.

Benesty, J., Chen, J., Huang, Y., and Cohen, I. (2009). Noise Reduction in Speech Processing, chapter Pearson Correlation Coefficient, pages 1–4. Springer Berlin Heidelberg.

Buuren, S. and Groothuis-Oudshoorn, C. (2011). Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45:1–67.

Cuenod, A., Foucault, F., Pfluger, V., and Egli, A. (2021). Factors associated with malditof mass spectral quality of species identification in clinical routine diagnostics. Frontiers in Cellular and Infection Microbiology, 11.

de Souza, R., Ambrosini, A., and Passaglia, L. (2015). Plant growth-promoting bacteria as inoculants in agricultural soils. Genetics and Molecular Biology, 38:401 – 419.

Dong, Y. and Peng, J. (2013). Principled missing data methods for researchers. SpringerPlus, 2:222.

Floudas, C. (1995). Nonlinear and Mixed-Integer Optimization: Fundamentals and Applications. Oxford Academic.

Haider, A., M. Ringer, Z. K., Mohacsi-Farkas, C., and Kocsis, T. (2023). The current level of maldi-tof ms applications in the detection of microorganisms: A short review of benefits and limitations. Microbiology Research, 14(1):80–89.

Henry, A. J., Hevelone, N. D., Lipsitz, S., and Nguyen, L. L. (2013). Comparative methods for handling missing data in large databases. Journal of Vascular Surgery, 58(5):1353–1359.e6.

Hentenryck, P. V. and Michel, L. (2013). The objective-cp optimization system. In Proceedings of the 19th International Conference on Principles and Practice of Constraint Programming, CP’13, page 8–29, Berlin, Heidelberg. Springer-Verlag.

Ke, X., Keenan, K., and Smith, V. (2022). Treatment of missing data in bayesian network structure learning: an application to linked biomedical and social survey data. BMC Medical Research Methodology, 22.

Kononenko, I. (1994). Estimating attributes: Analysis and extensions of relief. In Proceedings of the 7th European Conference on Machine Learning, page 171–182, Berlin, Heidelberg. Springer-Verlag.

Lin, J.-H. and Haug, P. (2008). Exploiting missing clinical data in bayesian network modeling for predicting medical problems. Journal of Biomedical Informatics, 41(1):1–14.

Mitchell, T. (1997). Machine Learning. McGraw-Hill, New York. Mubaroq, T., Sugiharti, E., and Akhlis, I. (2019). Application of discretization and information gain on naive bayes to diagnose heart disease. Journal of Advances in Information Systems and Technology, 1(1):75–82.

Navas-Palencia, G. (2022). Optimal binning: mathematical programming formulation, arxiv:2001.08025.

Pedregosa, F., Varoquaux, G., Gramfort, A., Thirion, V. M. B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12:2825–2830.

Pérez, N., Guevara López, M., A.S., and Ramos, I. (2015). Improving the mann–whitney statistical test for feature selection: An approach in breast cancer diagnosis on mammography. Artificial Intelligence in Medicine, 63(1):19–31.

Reddy, E., Gurrala, A., Hasitha, V., and Kumar, K. (2022). Bayesian Reasoning and Gaussian Processes for Machine Learning Applications, chapter Introduction to Naive Bayes and a Review on Its Subtypes with Applications, pages 1–14. Chapman and Hall/CRC eBooks.

Silva, R. R., Tomachewski, D., and ao, C. G. (Instituição de registro: INPI Instituto Nacional da Propriedade Industrial. BR512019002529-6, Nov. 2019). Banco de dados de massa molecular de proteÍnas ribossomais baseado em genomas bacterianos.

Tahan, M. and Asadi, S. (2018). Emdid: Evolutionary multi-objective discretization for imbalanced datasets. Information Sciences, 432:442–461.

Tomachewski., D., Galvão, C., de A. Campos Jr, Guimarães, A., da Rocha, J., and Etto, R. (2018). Ribopeaks: a web tool for bacterial classification through m/z data from ribosomal proteins. Bioinformatics, 34(17):3058–3060.

Weis, C., Jutzeler, C., and Borgwardt, K. (2020). Machine learning for microbial identification and antimicrobial susceptibility testing on MALDI-TOF mass spectra: a systematic review. Clinical Microbiology and Infection, 26(10):1310–1317.

Yang, Y. and Webb, G. (2009). Discretization for naive-bayes learning: Managing discretization bias and variance. Machine Learning, 74:39–74.

Zhang, L., Ray, H., Priestley, J., and Tan, S. (2020). A descriptive study of variable discretization and cost-sensitive logistic regression on imbalanced credit data. Journal of Applied Statistics, 47(3):568–581.

Zhang, R., Zhang, Y., Zhang, T., Xu, W., Wang, H., Zhang, S., Zhang, T., Zhou, W., and Shi, G. (2022). Establishing a maldi-tof-tof-ms method for rapid identification of three common gram-positive bacteria (bacillus cereus, listeria monocytogenes, and micrococcus luteus) associated with foodborne diseases. Food Sci. and Tech., 42.
Publicado
08/11/2023
ROCHA, J. C. F. da; CAMPOS JR., A.; ETTO, R. M.; GALVÃO, C. W.; FEDACZ, G. L.; SILVA, R. R. da; OLIVEIRA, A. S. S.. Exploring conditional missing patterns for automated bacteria identification using MALDI-TOF MS data. In: CONGRESSO BRASILEIRO DE AGROINFORMÁTICA (SBIAGRO), 14. , 2023, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 222-229. ISSN 2177-9724. DOI: https://doi.org/10.5753/sbiagro.2023.26562.