Missing Data Under Green AI Umbrella
Resumo
Missing data is a common issue that can undermine machine learning performance, and imputation methods have emerged as state-of-the-art solutions. However, training these methods can be costly and environmentally impactful. In this work, we investigate the missing data problem under Green AI constraints using a Data-Centric AI approach. We evaluate three missingness mechanisms, four missing rates, and ten datasets to assess both data quality and downstream performance. We also propose an optimization model to select the best-performing imputation method while considering sustainability constraints, offering a path toward more responsible and effective data imputation.Referências
Ali, N. A. and Omer, Z. M. (2017). Improving accuracy of missing data imputation in data mining. Kurdistan Journal of Applied Research, pages 66–73.
Bolón-Canedo, V., Morán-Fernández, L., Cancela, B., and Alonso-Betanzos, A. (2024). A review of green artificial intelligence: Towards a more sustainable future. Neurocomputing, 599:128096.
Buuren, S. and Groothuis-Oudshoorn, C. (2011). Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3):1–67.
Clemente, F., Ribeiro, G. M., Quemy, A., Santos, M. S., Pereira, R. C., and Barros, A. (2023). ydata-profiling: Accelerating data-centric ai with high-quality data. Neurocomputing, 554:126585.
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., and Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1).
García-Laencina, P. J., Sancho-Gómez, J.-L., and Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications, 19(2):263–282.
Hasan, M. K., Alam, M. A., Roy, S., Dutta, A., Jawad, M. T., and Das, S. (2021). Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Informatics in Medicine Unlocked, 27:100799.
Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. (2019). Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700.
Mangussi, A. D., Pereira, R. C., Abreu, P. H., and Lorena, A. C. (2025a). Assessing adversarial effects of noise in missing data imputation. In Paes, A. and Verri, F. A. N., editors, Intelligent Systems, pages 200–214, Cham. Springer Nature Switzerland.
Mangussi, A. D., Pereira, R. C., Lorena, A. C., Santos, M. S., and Abreu, P. H. (2025b). Studying the robustness of data imputation methodologies against adversarial attacks. Computers Security, 157:104574.
Mangussi, A. D., Santos, M. S., Lopes, F. L., Pereira, R. C., Lorena, A. C., and Abreu, P. H. (2025c). mdatagen: A python library for the artificial generation of missing data. Neurocomputing, 625:129478.
Pereira, R. C., Abreu, P. H., and Rodrigues, P. P. (2024a). Siamese autoencoder architecture for the imputation of data missing not at random. Journal of Computational Science, 78:102269.
Pereira, R. C., Abreu, P. H., Rodrigues, P. P., and Figueiredo, M. A. (2024b). Imputation of data missing not at random: Artificial generation and benchmark analysis. Expert Systems with Applications, 249:123654.
Salehi, S. and Schmeink, A. (2024). Data-centric green artificial intelligence: A survey. IEEE Transactions on Artificial Intelligence, 5(5):1973–1989.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., and Abreu, P. H. (2019). Generating synthetic missing data: A review by missing mechanism. IEEE Access, 7:11651–11667.
Shwartz-Ziv, R. and Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90.
Stekhoven, D. and Bühlmann, P. (2012). Missforest?non-parametric missing value imputation for mixed-type data. Bioinformatics (Oxford, England), 28:112–8.
Verdecchia, R., Cruz, L., Sallou, J., Lin, M., Wickenden, J., and Hotellier, E. (2022). Data-Centric Green AI An Exploratory Empirical Study . In 2022 International Conference on ICT for Sustainability (ICT4S), pages 35–45, Los Alamitos, CA, USA. IEEE Computer Society.
Bolón-Canedo, V., Morán-Fernández, L., Cancela, B., and Alonso-Betanzos, A. (2024). A review of green artificial intelligence: Towards a more sustainable future. Neurocomputing, 599:128096.
Buuren, S. and Groothuis-Oudshoorn, C. (2011). Mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3):1–67.
Clemente, F., Ribeiro, G. M., Quemy, A., Santos, M. S., Pereira, R. C., and Barros, A. (2023). ydata-profiling: Accelerating data-centric ai with high-quality data. Neurocomputing, 554:126585.
Emmanuel, T., Maupong, T., Mpoeleng, D., Semong, T., Mphago, B., and Tabona, O. (2021). A survey on missing data in machine learning. Journal of Big Data, 8(1).
García-Laencina, P. J., Sancho-Gómez, J.-L., and Figueiras-Vidal, A. R. (2010). Pattern classification with missing data: a review. Neural Computing and Applications, 19(2):263–282.
Hasan, M. K., Alam, M. A., Roy, S., Dutta, A., Jawad, M. T., and Das, S. (2021). Missing value imputation affects the performance of machine learning: A review and analysis of the literature (2010–2021). Informatics in Medicine Unlocked, 27:100799.
Lacoste, A., Luccioni, A., Schmidt, V., and Dandres, T. (2019). Quantifying the carbon emissions of machine learning. arXiv preprint arXiv:1910.09700.
Mangussi, A. D., Pereira, R. C., Abreu, P. H., and Lorena, A. C. (2025a). Assessing adversarial effects of noise in missing data imputation. In Paes, A. and Verri, F. A. N., editors, Intelligent Systems, pages 200–214, Cham. Springer Nature Switzerland.
Mangussi, A. D., Pereira, R. C., Lorena, A. C., Santos, M. S., and Abreu, P. H. (2025b). Studying the robustness of data imputation methodologies against adversarial attacks. Computers Security, 157:104574.
Mangussi, A. D., Santos, M. S., Lopes, F. L., Pereira, R. C., Lorena, A. C., and Abreu, P. H. (2025c). mdatagen: A python library for the artificial generation of missing data. Neurocomputing, 625:129478.
Pereira, R. C., Abreu, P. H., and Rodrigues, P. P. (2024a). Siamese autoencoder architecture for the imputation of data missing not at random. Journal of Computational Science, 78:102269.
Pereira, R. C., Abreu, P. H., Rodrigues, P. P., and Figueiredo, M. A. (2024b). Imputation of data missing not at random: Artificial generation and benchmark analysis. Expert Systems with Applications, 249:123654.
Salehi, S. and Schmeink, A. (2024). Data-centric green artificial intelligence: A survey. IEEE Transactions on Artificial Intelligence, 5(5):1973–1989.
Santos, M. S., Pereira, R. C., Costa, A. F., Soares, J. P., Santos, J., and Abreu, P. H. (2019). Generating synthetic missing data: A review by missing mechanism. IEEE Access, 7:11651–11667.
Shwartz-Ziv, R. and Armon, A. (2022). Tabular data: Deep learning is not all you need. Information Fusion, 81:84–90.
Stekhoven, D. and Bühlmann, P. (2012). Missforest?non-parametric missing value imputation for mixed-type data. Bioinformatics (Oxford, England), 28:112–8.
Verdecchia, R., Cruz, L., Sallou, J., Lin, M., Wickenden, J., and Hotellier, E. (2022). Data-Centric Green AI An Exploratory Empirical Study . In 2022 International Conference on ICT for Sustainability (ICT4S), pages 35–45, Los Alamitos, CA, USA. IEEE Computer Society.
Publicado
29/09/2025
Como Citar
MANGUSSI, Arthur Dantas; PEREIRA, Ricardo Cardoso; ABREU, Pedro Henriques; LORENA, Ana Carolina.
Missing Data Under Green AI Umbrella. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1021-1032.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.14312.
