AutoBioLearn: An Automated Data Science Framework for eXplainable Analyses (XAI) of Clinical Datasets

Lucas P. B. Moreira; Maria L. G. Kuniyoshi; Zofia Wicik; David C. Martins-Jr; Helena Brentani; Sérgio N. Simões

doi:10.5753/bsb.2024.245584

Lucas P. B. Moreira IFES
Maria L. G. Kuniyoshi USP
Zofia Wicik Med. Univ. Warsaw / Inst. of Psychiatry & Neurology
David C. Martins-Jr UFABC
Helena Brentani USP
Sérgio N. Simões IFES

DOI: https://doi.org/10.5753/bsb.2024.245584

Resumo

With the increasing volume of biological and medical data, the application of efficient data science techniques has become essential for analysis. However, healthcare data scientists often need to integrate and analyze multiple datasets simultaneously. Although these analyses share similarities, they require adjustments to various parameters, delaying development and further hindering knowledge discovery. In this paper, we propose a framework that encapsulates all stages of typical data science analyses, from data pre-processing, execution, and evaluation to the interpretation of models. In addition, the framework includes XAI analyses. In tests involving a clinical dataset, the framework achieved a reduction of 92% in lines of code.

Referências

Adadi, A. and Berrada, M. (2018). Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE access, 6:52138–52160.

Ahmed, Z., Mohamed, K., Zeeshan, S., and Dong, X. (2020). Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database, 2020.

Alanazi, A. (2022). Using machine learning for healthcare challenges and opportunities. Informatics in Medicine Unlocked, 30:100924.

Alber, M., Buganza Tepole, A., Cannon, W. R., De, S., Dura-Bernal, S., Garikipati, K., Karniadakis, G., Lytton, W. W., Perdikaris, P., Petzold, L., et al. (2019). Integrating machine learning and multiscale modeling—perspectives, challenges, and opportunities in the biological, biomedical, and behavioral sciences. NPJ digital medicine, 2(1):1–11.

Antoniadi, A. M., Du, Y., Guendouz, Y., Wei, L., Mazo, C., Becker, B. A., and Mooney, C. (2021). Current challenges and future opportunities for xai in machine learning-based clinical decision support systems: a systematic review. Applied Sciences, 11(11):5088.

Batista, G. E., Prati, R. C., and Monard, M. C. (2004). A study of the behavior of several methods for balancing machine learning training data. ACM SIGKDD explorations newsletter, 6(1):20–29.

Berthold, M. R., Cebron, N., Dill, F., Gabriel, T. R., Kötter, T., Meinl, T., Ohl, P., Thiel, K., and Wiswedel, B. (2009). Knime - the konstanz information miner: Version 2.0 and beyond. SIGKDD Explor. Newsl., 11(1):26–31.

Bonthu, H. (2021). Detecting and treating outliers — treating the odd one out! Chen, T. and Guestrin, C. (2016). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, page 785–794, New York, NY, USA. Association for Computing Machinery.

Chicco, D. and Jurman, G. (2020). Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Information Decision Making, 20(16).

de Souza, R., Ribeiro, R., Ferlin, C., Goldschmidt, R., Carvalho, L., and Soares, J. (2018). Apoiando o processo de imputação com técnicas de aprendizado de máquina. In Anais do XXXIII Simpósio Brasileiro de Banco de Dados, pages 259–264, Porto Alegre, RS, Brasil. SBC.

Demšar, J., Curk, T., Erjavec, A., Črt Gorup, Hočevar, T., Milutinovič, M., Možina, M., Polajnar, M., Toplak, M., Starič, A., Štajdohar, M., Umek, L., Žagar, L., Žbontar, J., Žitnik, M., and Zupan, B. (2013). Orange: Data mining toolbox in python. Journal of Machine Learning Research, 14:2349–2353.

Dorogush, A. V., Ershov, V., and Gulin, A. (2018). Catboost: gradient boosting with categorical features support. arXiv preprint arXiv:1810.11363.

Duffy, D. J. (2016). Problems, challenges and promises: perspectives on precision medicine. Briefings in bioinformatics, 17(3):494–504.

Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., and Liu, T.-Y. (2017). Lightgbm: A highly efficient gradient boosting decision tree. Advances in neural information processing systems, 30.

Le, T. T., Fu, W., and Moore, J. H. (2020). Scaling tree-based automated machine learning to biomedical big data with a feature set selector. Bioinformatics, 36(1):250–256.

LeDell, E. and Poirier, S. (2020). H2O AutoML: Scalable automatic machine learning. 7th ICML Workshop on Automated Machine Learning (AutoML).

Lee, G., Singanamalli, A., Wang, H., Feldman, M. D., Master, S. R., Shih, N. N., Spangler, E., Rebbeck, T., Tomaszewski, J. E., and Madabhushi, A. (2014). Supervised multi-view canonical correlation analysis (smvcca): Integrating histologic and proteomic features for predicting recurrent prostate cancer. IEEE transactions on medical imaging, 34(1):284–297.

Lundberg, S. M. and Lee, S.-I. (2017). A unified approach to interpreting model predictions. Advances in neural information processing systems, 30.

MacEachern, S. J. and Forkert, N. D. (2021). Machine learning for precision medicine. Genome, 64(4):416–425.

Mohri, M., Rostamizadeh, A., and Talwalkar, A. (2018). Foundations of machine learning. MIT press.

Murphy, K. P. (2012). Machine learning: a probabilistic perspective. MIT press.

Neto, M. G. (2021). Explainable ai: extraindo explicações e aumentando a confiança dos modelos de ml.

Pecorelli, F., Di Nucci, D., De Roover, C., and De Lucia, A. (2020). A large empirical assessment of the role of data balancing in machine-learning-based code smell detection. Journal of Systems and Software, 169:110693.

Petch, J., Di, S., and Nelson, W. (2022). Opening the black box: The promise and limitations of explainable machine learning in cardiology. Canadian Journal of Cardiology, 38(2):204–213. Focus Issue: New Digital Technologies in Cardiology.

Shailaja, K., Seetharamulu, B., and Jabbar, M. (2018). Machine learning in healthcare: A review. In 2018 Second international conference on electronics, communication and aerospace technology (ICECA), pages 910–914. IEEE.

Shapley, L. S. (1953). 17. A Value for n-Person Games, pages 307–318. Princeton University Press, Princeton.

Slack, D., Hilgard, S., Jia, E., Singh, S., and Lakkaraju, H. (2020). Fooling lime and shap: Adversarial attacks on post hoc explanation methods. In Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society, AIES ’20, page 180–186, New York, NY, USA. Association for Computing Machinery.

Srinivasu, P. N., Sandhya, N., Jhaveri, R. H., and Raut, R. (2022). From blackbox to explainable ai in healthcare: existing tools and case studies. Mobile Information Systems, 2022:1–20.

Tsamardinos, I., Charonyktakis, P., Papoutsoglou, G., Borboudakis, G., Lakiotaki, K., Zenklusen, J. C., Juhl, H., Chatzaki, E., and Lagani, V. (2022). Just add data: automated predictive modeling for knowledge discovery and feature selection. NPJ precision oncology, 6(1):38.