Data stratification analysis on the propagation of discriminatory effects in binary classification

  • Diego Minatel Universidade de São Paulo (USP)
  • Angelo Cesar Mendes da Silva Universidade de São Paulo (USP)
  • Nícolas Roque dos Santos Universidade de São Paulo (USP)
  • Mariana Curi Universidade de São Paulo (USP)
  • Ricardo Marcondes Marcacini Universidade de São Paulo (USP)
  • Alneu de Andrade Lopes Universidade de São Paulo (USP)


Unfair decision-making supported by machine learning, which harms or benefits a specific group of people, is frequent. In many cases, the models only reproduce the biases in the data, which does not absolve its responsibility for these decisions. Thus, with the increase in the automation of activities through machine learning models, it is mandatory to prospect solutions that add fairness factors to the models and clarity about the supported decisions. One option to mitigate model discrimination is quantifying the ratio of instances belonging to each target class to build data sets that approximate the actual data distribution. This alternative aims to reduce the responsibility of data on discriminatory effects and direct the function of treating them to the models. In this sense, we propose to analyze different types of data stratification, including stratification by sociodemographic groups that are historically unprivileged, and associate these stratification types to the fairer or unfairer models. According to our results, stratification by class and group of people helps to develop fairer models, reducing the discriminatory effects in binary classification.
Palavras-chave: analysis, binary classification, data bias, data stratification, discriminatory effects, fairness, machine learning, unfairness


Alikhademi, K., Drobina, E., Prioleau, D., Richardson, B., Purves, D., and Gilbert, J. E. A review of predictive policing from the perspective of fairness. Artificial Intelligence and Law, 2022.

Angwin, J., Larson, J., Mattu, S., and Kirchner, L. Machine bias: Risk assessments in criminal sentencing, 2016.

Barocas, S., Hardt, M., and Narayanan, A. Fairness in machine learning. Nips tutorial vol. 1, pp. 2017, 2017.

Barocas, S. and Selbst, A. D. Big data’s disparate impact. Calif. L. Rev. vol. 104, pp. 671, 2016.

Bellamy, R. K. E., Dey, K., Hind, M., Hoffman, S. C., Houde, S., Kannan, K., Lohia, P., Martino, J., Mehta, S., Mojsilovic, A., Nagar, S., Ramamurthy, K. N., Richards, J., Saha, D., Sattigeri, P., Singh, M., Varshney, K. R., and Zhang, Y. AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias, 2018.

BRASIL. Constituição da República Federativa do Brasil. Brasília, DF: Centro Gráfico, 1988.

Breiman, L. Random forests. Machine learning vol. 45, pp. 5–32, 2001.

Buitinck, L., Louppe, G., Blondel, M., Pedregosa, F., Mueller, A., Grisel, O., Niculae, V., Prettenhofer, P., Gramfort, A., Grobler, J., Layton, R., VanderPlas, J., Joly, A., Holt, B., and Varoquaux, G. API design for machine learning software: experiences from the scikit-learn project. In ECML PKDD Workshop: Languages for Data Mining and Machine Learning. pp. 108–122, 2013.

Buolamwini, J. and Gebru, T. Gender shades: Intersectional accuracy disparities in commercial gender classification. In Conference on fairness, accountability and transparency. pp. 77–91, 2018.

Chen, T., He, T., Benesty, M., Khotilovich, V., Tang, Y., Cho, H., Chen, K., Mitchell, R., Cano, I., Zhou, T., et al. Xgboost: extreme gradient boosting. R package version 0.4-2 1 (4): 1–4, 2015.

Cortes, C. and Vapnik, V. Support-vector networks. Machine learning vol. 20, pp. 273–297, 1995.

Demšar, J. Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. vol. 7, pp. 1–30, Dec., 2006.

Dua, D. and Graff, C. UCI machine learning repository, 2017.

Dwork, C., Hardt, M., Pitassi, T., Reingold, O., and Zemel, R. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference. pp. 214–226, 2012.

Gerdon, F., Bach, R. L., Kern, C., and Kreuter, F. Social impacts of algorithmic decision-making: A research agenda for the social sciences. Big Data & Society 9 (1): 20539517221089305, 2022.

Goodman, B. and Flaxman, S. European union regulations on algorithmic decision-making and a “right to explanation”. AI Magazine 38 (3): 50–57, 2017.

Hanna, A., Denton, E., Smart, A., and Smith-Loud, J. Towards a critical race methodology in algorithmic fairness. In Proceedings of the 2020 conference on fairness, accountability, and transparency. pp. 501–512, 2020.

Hardt, M., Price, E., and Srebro, N. Equality of opportunity in supervised learning. Advances in neural information processing systems vol. 29, pp. 3315–3323, 2016.

Hinton, G. E. Connectionist learning procedures. Artificial Intelligence 40 (1): 185–234, 1989.

Howard, A. and Borenstein, J. The ugly truth about ourselves and our robot creations: the problem of bias and social inequity. Science and engineering ethics 24 (5): 1521–1536, 2018.

Kohavi, R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI. Morgan Kaufmann Publishers Inc., pp. 1137–1143, 1995.

Larson, J., Mattu, S., Kirchner, L., and Angwin, J. How we analyzed the compas recidivism algorithm, 2016.

Le Quy, T., Roy, A., Iosifidis, V., Zhang, W., and Ntoutsi, E. A survey on datasets for fairness-aware machine learning. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 12 (3): e1452, 2022.

Loh, W.-Y. Classification and regression trees. Wiley interdisciplinary reviews: data mining and knowledge discovery 1 (1): 14–23, 2011.

Martin Hirzel, K. K. and Ram, P. Engineering fair machine learning pipelines. target 73 (2.2): 1–028, 2021.

Mehrabi, N., Morstatter, F., Saxena, N., Lerman, K., and Galstyan, A. A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR) 54 (6): 1–35, 2021.

Parmezan, A. R. S., Lee, H. D., and Wu, F. C. Metalearning for choosing feature selection algorithms in data mining: Proposal of a new framework. Expert Systems with Applications vol. 75, pp. 1–24, 2017.

Pessach, D. and Shmueli, E. A review on fairness in machine learning. ACM Computing Surveys (CSUR) 55 (3): 1–44, 2022.

Valentim, I., Lourenço, N., and Antunes, N. The impact of data preparation on the fairness of software systems. In 2019 IEEE 30th International Symposium on Software Reliability Engineering (ISSRE). IEEE, pp. 391–401, 2019.

Vanschoren, J., van Rijn, J. N., Bischl, B., and Torgo, L. Openml: networked science in machine learning. SIGKDD Explorations 15 (2): 49–60, 2013.

Zhang, B. H., Lemoine, B., and Mitchell, M. Mitigating unwanted biases with adversarial learning. In Proceedings of the 2018 AAAI/ACM Conference on AI, Ethics, and Society. pp. 335–340, 2018.
MINATEL, Diego; DA SILVA, Angelo Cesar Mendes; DOS SANTOS, Nícolas Roque; CURI, Mariana; MARCACINI, Ricardo Marcondes; LOPES, Alneu de Andrade. Data stratification analysis on the propagation of discriminatory effects in binary classification. In: SYMPOSIUM ON KNOWLEDGE DISCOVERY, MINING AND LEARNING (KDMILE), 11. , 2023, Belo Horizonte/MG. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2023 . p. 73-80. ISSN 2763-8944. DOI: