Can Data Complexity Measures Detect Pre-Training Bias in Machine Learning? A Case-Study with Health Data

Gabriel Difforeni Leal; Diego Dimer Rodrigues; Júlia Mombach da Silva; Mariana Recamonde-Mendoza

doi:10.5753/sbcas.2025.7492

Gabriel Difforeni Leal UFRGS
Diego Dimer Rodrigues UFRGS
Júlia Mombach da Silva UFRGS
Mariana Recamonde-Mendoza UFRGS / HCPA

DOI: https://doi.org/10.5753/sbcas.2025.7492

Resumo

Bias in healthcare data can negatively impact vulnerable populations and reduce the reliability of predictive models. This work investigates the use of data complexity measures to identify features that may introduce bias during model training with machine learning (ML) algorithms. We analyze a synthetic dataset on schizophrenia and depression and a real dataset on liver disease, evaluating complexity across subgroups defined by protected attributes such as sex and race. The approach is validated against traditional pre-training bias metrics. Preliminary results suggest that data complexity measures can serve as an early indicator of bias, supporting the development of fairer and more transparent predictive models. This framework could inform bias mitigation strategies to improve model fairness in health-related ML applications.

Referências

Arruda, J. L., Prudêncio, R. B., and Lorena, A. C. (2020). Measuring instance hardness using data complexity measures. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part II 9, pages 483–497. Springer.

Karamizadeh, S., Abdullah, S. M., Manaf, A. A., Zamani, M., and Hooman, A. (2013). An overview of principal component analysis. Journal of signal and information processing, 4(3):173–175.

Lorena, A. C., Garcia, L. P. F., Lehmann, J., Souto, M. C. P., and Ho, T. K. (2019). How complex is your classification problem?: A survey on measuring classification complexity. ACM Computing Surveys (CSUR), 52(1):1–34.

Maslej, M. et al. (2022). Intersectional-Bias-Assessment. INCF. Available on internet: [link].

Ramana, B. and Venkateswarlu, N. (2022). ILPD (Indian Liver Patient Dataset). UCI Machine Learning Repository. DOI: 10.24432/C5D02C.

Rodrigues, D. D. (2023). Assessing pre-training bias in health data and estimating its impact on machine learning algorithms. Bachelor’s thesis, Ciência da Computação, Instituto de Informática, Universidade Federal do Rio Grande do Sul.

Sotoca, J. M., Sánchez, J. S., and Mollineda, R. A. (2005). A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje. TAMIDA, 77.

Can Data Complexity Measures Detect Pre-Training Bias in Machine Learning? A Case-Study with Health Data

Resumo

Referências

Artigos mais lidos do(s) mesmo(s) autor(es)