Can Data Complexity Measures Detect Pre-Training Bias in Machine Learning? A Case-Study with Health Data
Resumo
Bias in healthcare data can negatively impact vulnerable populations and reduce the reliability of predictive models. This work investigates the use of data complexity measures to identify features that may introduce bias during model training with machine learning (ML) algorithms. We analyze a synthetic dataset on schizophrenia and depression and a real dataset on liver disease, evaluating complexity across subgroups defined by protected attributes such as sex and race. The approach is validated against traditional pre-training bias metrics. Preliminary results suggest that data complexity measures can serve as an early indicator of bias, supporting the development of fairer and more transparent predictive models. This framework could inform bias mitigation strategies to improve model fairness in health-related ML applications.Referências
Arruda, J. L., Prudêncio, R. B., and Lorena, A. C. (2020). Measuring instance hardness using data complexity measures. In Intelligent Systems: 9th Brazilian Conference, BRACIS 2020, Rio Grande, Brazil, October 20–23, 2020, Proceedings, Part II 9, pages 483–497. Springer.
Karamizadeh, S., Abdullah, S. M., Manaf, A. A., Zamani, M., and Hooman, A. (2013). An overview of principal component analysis. Journal of signal and information processing, 4(3):173–175.
Lorena, A. C., Garcia, L. P. F., Lehmann, J., Souto, M. C. P., and Ho, T. K. (2019). How complex is your classification problem?: A survey on measuring classification complexity. ACM Computing Surveys (CSUR), 52(1):1–34.
Maslej, M. et al. (2022). Intersectional-Bias-Assessment. INCF. Available on internet: [link].
Ramana, B. and Venkateswarlu, N. (2022). ILPD (Indian Liver Patient Dataset). UCI Machine Learning Repository. DOI: 10.24432/C5D02C.
Rodrigues, D. D. (2023). Assessing pre-training bias in health data and estimating its impact on machine learning algorithms. Bachelor’s thesis, Ciência da Computação, Instituto de Informática, Universidade Federal do Rio Grande do Sul.
Sotoca, J. M., Sánchez, J. S., and Mollineda, R. A. (2005). A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje. TAMIDA, 77.
Karamizadeh, S., Abdullah, S. M., Manaf, A. A., Zamani, M., and Hooman, A. (2013). An overview of principal component analysis. Journal of signal and information processing, 4(3):173–175.
Lorena, A. C., Garcia, L. P. F., Lehmann, J., Souto, M. C. P., and Ho, T. K. (2019). How complex is your classification problem?: A survey on measuring classification complexity. ACM Computing Surveys (CSUR), 52(1):1–34.
Maslej, M. et al. (2022). Intersectional-Bias-Assessment. INCF. Available on internet: [link].
Ramana, B. and Venkateswarlu, N. (2022). ILPD (Indian Liver Patient Dataset). UCI Machine Learning Repository. DOI: 10.24432/C5D02C.
Rodrigues, D. D. (2023). Assessing pre-training bias in health data and estimating its impact on machine learning algorithms. Bachelor’s thesis, Ciência da Computação, Instituto de Informática, Universidade Federal do Rio Grande do Sul.
Sotoca, J. M., Sánchez, J. S., and Mollineda, R. A. (2005). A review of data complexity measures and their applicability to pattern classification problems. Actas del III Taller Nacional de Mineria de Datos y Aprendizaje. TAMIDA, 77.
Publicado
09/06/2025
Como Citar
LEAL, Gabriel Difforeni; RODRIGUES, Diego Dimer; SILVA, Júlia Mombach da; RECAMONDE-MENDOZA, Mariana.
Can Data Complexity Measures Detect Pre-Training Bias in Machine Learning? A Case-Study with Health Data. In: SIMPÓSIO BRASILEIRO DE COMPUTAÇÃO APLICADA À SAÚDE (SBCAS), 25. , 2025, Porto Alegre/RS.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1011-1016.
ISSN 2763-8952.
DOI: https://doi.org/10.5753/sbcas.2025.7492.