Impact of Data Imputation on Air Quality Prediction: A Case Study in Congonhas-MG

Abstract


The number of people affected by diseases related to poor air quality has increased significantly over the years, totaling approximately 6.7 million deaths annually worldwide. However, there is still a lack of specific applications focused on predicting air quality to warn the population about imminent risks. Given this scenario, the literature presents various machine-learning techniques that can be used to forecast air quality. Nevertheless, the databases must be complete without missing values for these techniques to be effective. This work investigates the impact of imputing missing data on air quality prediction in the city of Congonhas-MG. The results indicate that, although there are simple imputation methods and algorithms capable of handling incomplete data, applying appropriate techniques to fill these gaps significantly improves the accuracy of air quality predictions. This enables more efficient warnings to the population about the risks associated with exposure to air pollutants.

Keywords: Data Analysis and Mining for Urban Environments, Smart Cities, Urban Computing for Economic Development, Urban Computing for Public Protection and Safety, Anomaly Detection and Event Discovery in Urban Areas, E-Health and m-Health, Green Computing in Urban Environments, Urban Sensing Infrastructures, Internet of Things, Improving Quality of Life in the City Using Mobile Services and Big Data, Environmental Protection with Urban Computing, Participatory/Opportunistic Sensing, Urban Data Visualization

References

Alwateer, M., Atlam, E.-S., Abd El-Raouf, M. M., Ghoneim, O. A., and Gad, I. (2024). Missing data imputation: A comprehensive review. Journal of Computer and Communications, 12(11), 53–75.

Andrade, P., da Luz, J., and Campos, A. (2016). Cumulative impact assessment on air quality from multiple open pit mines. Clean Technologies and Environmental Policy, 18, 483–492.

Anil Jadhav, D. P. and Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913–933.

Braga, A. L. F., Pereira, L. A. A., Procópio, M., André, P. A. d., and Saldiva, P. H. d. N. (2007). Associação entre poluição atmosférica e doenças respiratórias e cardiovasculares na cidade de Itabira, Minas Gerais, Brasil. Cadernos de Saúde Pública, 23(suppl 4), S570–S578.

Campos, G., Cunha, F., and Villas, L. (2021). Análise de poluição atmosférica utilizando modelos de sensoriamento virtual. In Anais do V Workshop de Computação Urbana (pp. 29–42). Porto Alegre, RS, Brasil: SBC.

Doreswamy, Gad, I., and Manjunatha, B. (2017). Performance evaluation of predictive models for missing data imputation in weather data. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1327–1334).

Hua, V., Nguyen, T., Dao, M.-S., Nguyen, H. D., and Nguyen, B. T. (2024). The impact of data imputation on air quality prediction problem. PLOS ONE, 19(9), 1–39.

Jornal Estado de Minas (2021). Nuvem de poeira encobre congonhas e revolta: ’muito ruim abrir os olhos’. [link]. Acesso em: 2024-03-19.

Kebalepile, M. M., Dzikiti, L. N., and Voyi, K. (2024). Using diverse data sources to impute missing air quality data collected in a resource-limited setting. Atmosphere, 15(3).

Khan, S. and Hoque, A. (2020). SICE: An improved missing data imputation technique. Journal of Big Data, 7(37).

Lelieveld, J., Haines, A., Burnett, R., Tonne, C., Klingmüller, K., Münzel, T., and Pozzer, A. (2023). Air pollution deaths attributable to fossil fuels: Observational and modelling study. BMJ, 383.

Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News, 2, 18–22. [link].

Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202.

Liu, R. A., Wei, Y., Qiu, X., Kosheleva, A., and Schwartz, J. D. (2022). Short term exposure to air pollution and mortality in the US: A double negative control analysis. Environmental Health, 21(1), 81.

Luiz, C. D. Santolim, Flávio Curbani, T. J. M. (2017). Air quality assessment and design of the monitoring network of Congonhas, MG, Brazil. In Proceedings of the 3rd CMAS South America - Air Pollution Conference, Brazil.

Morozesk, M., da Costa Souza, I., Fernandes, M. N., and Soares, D. C. F. (2021). Airborne particulate matter in an iron mining city: Characterization, cell uptake and cytotoxicity effects of nanoparticles from PM2.5, PM10 and PM20 on human lung cells. Environmental Advances, 6, 100125.

Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons.

Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.

van Buuren, S. and Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.

World Health Organization (2022). Ambient (outdoor) air pollution. [link]. Acesso em: 2024-03-19.
Published
2025-05-19
S. SILVA, João A.; D. CUNHA, Felipe. Impact of Data Imputation on Air Quality Prediction: A Case Study in Congonhas-MG. In: URBAN COMPUTING WORKSHOP (COURB), 9. , 2025, Natal/RN. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2025 . p. 71-84. ISSN 2595-2706. DOI: https://doi.org/10.5753/courb.2025.8700.