Improving models performance in a data-centric approach applied to the healthcare domain

M. G. Valeriano; C. R. V. Kiffer; A. C. Lorena

doi:10.5753/kdmile.2024.244519

M. G. Valeriano Unifesp / ITA
C. R. V. Kiffer Unifesp
A. C. Lorena ITA

DOI: https://doi.org/10.5753/kdmile.2024.244519

Resumo

Machine learning systems heavily rely on training data, and any biases or limitations in datasets can significantly impair the performance and trustworthiness of these models. This paper proposes an instance hardness data-centric approach to enhance ML systems, leveraging the potential of contrasting the profiles of groups of easy and hard instances on a dataset to design classification problems more effectively. We present a case study with a COVID dataset sourced from a public repository that was utilized to predict aggravated conditions based on parameters collected on the patient’s initial attendance. Our goal was to investigate the impact of different dataset design choices on the performance of the ML models. By adopting the concept of instance hardness, we identified instances that were consistently misclassified or correctly classified, forming distinct groups of hard and easy instances for further investigation. Analyzing the relationship between the original class, instance hardness level, and the information contained in the raw data source, we gained valuable insights into how changes in data assemblage can improve the performance of the ML models. Although the characteristics of the problem condition our analysis, the findings demonstrate the significant potential of a data-centric perspective in enhancing predictive models within the healthcare domain.

Palavras-chave: instance hardness, machine learning, healthcare

Referências

Bergstra, J., Komer, B., Eliasmith, C., Yamins, D., and Cox, D. D. Hyperopt: a python library for model selection and hyperparameter optimization. Computational Science & Discovery 8 (1): 014008, 2015.

Chatzimparmpas, A., Paulovich, F. V., and Kerren, A. Hardvis: Visual analytics to handle instance hardness using undersampling and oversampling techniques. arXiv preprint arXiv:2203.15753 , 2022.

Hüllermeier, E. and Waegeman, W. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine learning 110 (3): 457–506, 2021.

Liu, C., Smith-Miles, K., Wauters, T., and Costa, A. M. Instance space analysis for 2d bin packing mathematical models. European Journal of Operational Research 315 (2): 484–498, 2024.

Lorena, A. C., Paiva, P. Y., and Prudêncio, R. B. Trusting my predictions: on the value of instance-level analysis. ACM Computing Surveys 56 (7): 1–28, 2024.

Mello, L. E., Suman, A., Medeiros, C. B., Prado, C. A., Rizzatti, E. G., Nunes, F. L., Barnabé, G. F., Ferreira, J. E., Sá, J., Reis, L. F., et al. Opening brazilian covid-19 patient data to support world research on pandemics. Zenodo, 2020.

Napierala, K. and Stefanowski, J. Types of minority class examples and their influence on learning classifiers from imbalanced data. Journal of Intelligent Information Systems vol. 46, pp. 563–597, 2016.

Oala, L., Maskey, M., Bat-Leah, L., Parrish, A., Gürel, N. M., Kuo, T.-S., Liu, Y., Dror, R., Brajovic, D., Yao, X., et al. Dmlr: Data-centric machine learning research–past, present and future. arXiv preprint arXiv:2311.13028 , 2023.

Paiva, P. Y. A., Moreno, C. C., Smith-Miles, K., Valeriano, M. G., and Lorena, A. C. Relating instance hardness to classification performance in a dataset: a visual approach. Machine Learning, 2022.

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., et al. Scikit-learn: Machine learning in python. Journal of machine learning research 12 (Oct): 2825–2830, 2011.

Sambasivan, N., Kapania, S., Highfill, H., Akrong, D., Paritosh, P., and Aroyo, L. M. “everyone wants to do the model work, not the data work”: Data cascades in high-stakes ai. In proceedings of the 2021 CHI Conference on Human Factors in Computing Systems. pp. 1–15, 2021.

Seedat, N., Crabbé, J., Bica, I., and van der Schaar, M. Data-iq: Characterizing subgroups with heterogeneous outcomes in tabular data. arXiv preprint arXiv:2210.13043 , 2022.

Seedat, N., Imrie, F., and van der Schaar, M. Dc-check: A data-centric ai checklist to guide the development of reliable machine learning systems. arXiv preprint arXiv:2211.05764 , 2022.

Seedat, N., Imrie, F., and van der Schaar, M. Dissecting sample hardness: A fine-grained analysis of hardness characterization methods for data-centric ai. arXiv preprint arXiv:2403.04551 , 2024.

Smith, M. R., Martinez, T., and Giraud-Carrier, C. An instance level analysis of data complexity. Machine learning 95 (2): 225–256, 2014.

Valeriano, M., Matran-Fernandez, A., Kiffer, C., and Lorena, A. C. Understanding the performance of machine learning models from data-to patient-level. ACM Journal of Data and Information Quality, 2024.

Valeriano, M. G., Kiffer, C. R. V., and Lorena, A. C. Supporting decision making in health scenarios with machine learning models. In Anais do simposio brasileiro de pesquisa operacional, 2022.

Valeriano, M. G., Paiva, P. Y. A., Kiffer, C. R. V., and Lorena, A. C. A framework for characterizing what makes an instance hard to classify. In Brazilian Conference on Intelligent Systems. Springer, pp. 353–367, 2023.

Valeriano, M. G., Pereira, J. L. J., Kiffer, C. R. V., and Lorena, A. C. Explaining instances in the health domain based on the exploration of a dataset’s hardness embedding. In Proceedings of the Genetic and Evolutionary Computation Conference (GECCO ’24 Companion). ACM, Melbourne, VIC, Australia, 2024.

Zha, D., Bhat, Z. P., Lai, K.-H., Yang, F., Jiang, Z., Zhong, S., and Hu, X. Data-centric artificial intelligence: A survey. arXiv preprint arXiv:2303.10158 , 2023.