Impact of Data Imputation on Air Quality Prediction: A Case Study in Congonhas-MG
Abstract
The number of people affected by diseases related to poor air quality has increased significantly over the years, totaling approximately 6.7 million deaths annually worldwide. However, there is still a lack of specific applications focused on predicting air quality to warn the population about imminent risks. Given this scenario, the literature presents various machine-learning techniques that can be used to forecast air quality. Nevertheless, the databases must be complete without missing values for these techniques to be effective. This work investigates the impact of imputing missing data on air quality prediction in the city of Congonhas-MG. The results indicate that, although there are simple imputation methods and algorithms capable of handling incomplete data, applying appropriate techniques to fill these gaps significantly improves the accuracy of air quality predictions. This enables more efficient warnings to the population about the risks associated with exposure to air pollutants.
References
Andrade, P., da Luz, J., and Campos, A. (2016). Cumulative impact assessment on air quality from multiple open pit mines. Clean Technologies and Environmental Policy, 18, 483–492.
Anil Jadhav, D. P. and Ramanathan, K. (2019). Comparison of performance of data imputation methods for numeric dataset. Applied Artificial Intelligence, 33(10), 913–933.
Braga, A. L. F., Pereira, L. A. A., Procópio, M., André, P. A. d., and Saldiva, P. H. d. N. (2007). Associação entre poluição atmosférica e doenças respiratórias e cardiovasculares na cidade de Itabira, Minas Gerais, Brasil. Cadernos de Saúde Pública, 23(suppl 4), S570–S578.
Campos, G., Cunha, F., and Villas, L. (2021). Análise de poluição atmosférica utilizando modelos de sensoriamento virtual. In Anais do V Workshop de Computação Urbana (pp. 29–42). Porto Alegre, RS, Brasil: SBC.
Doreswamy, Gad, I., and Manjunatha, B. (2017). Performance evaluation of predictive models for missing data imputation in weather data. In Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics (ICACCI) (pp. 1327–1334).
Hua, V., Nguyen, T., Dao, M.-S., Nguyen, H. D., and Nguyen, B. T. (2024). The impact of data imputation on air quality prediction problem. PLOS ONE, 19(9), 1–39.
Jornal Estado de Minas (2021). Nuvem de poeira encobre congonhas e revolta: ’muito ruim abrir os olhos’. [link]. Acesso em: 2024-03-19.
Kebalepile, M. M., Dzikiti, L. N., and Voyi, K. (2024). Using diverse data sources to impute missing air quality data collected in a resource-limited setting. Atmosphere, 15(3).
Khan, S. and Hoque, A. (2020). SICE: An improved missing data imputation technique. Journal of Big Data, 7(37).
Lelieveld, J., Haines, A., Burnett, R., Tonne, C., Klingmüller, K., Münzel, T., and Pozzer, A. (2023). Air pollution deaths attributable to fossil fuels: Observational and modelling study. BMJ, 383.
Liaw, A. and Wiener, M. (2002). Classification and regression by randomForest. R News, 2, 18–22. [link].
Little, R. J. A. (1988). A test of missing completely at random for multivariate data with missing values. Journal of the American Statistical Association, 83(404), 1198–1202.
Liu, R. A., Wei, Y., Qiu, X., Kosheleva, A., and Schwartz, J. D. (2022). Short term exposure to air pollution and mortality in the US: A double negative control analysis. Environmental Health, 21(1), 81.
Luiz, C. D. Santolim, Flávio Curbani, T. J. M. (2017). Air quality assessment and design of the monitoring network of Congonhas, MG, Brazil. In Proceedings of the 3rd CMAS South America - Air Pollution Conference, Brazil.
Morozesk, M., da Costa Souza, I., Fernandes, M. N., and Soares, D. C. F. (2021). Airborne particulate matter in an iron mining city: Characterization, cell uptake and cytotoxicity effects of nanoparticles from PM2.5, PM10 and PM20 on human lung cells. Environmental Advances, 6, 100125.
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons.
Rubin, D. B. (1996). Multiple imputation after 18+ years. Journal of the American Statistical Association, 91(434), 473–489.
van Buuren, S. and Groothuis-Oudshoorn, K. (2011). MICE: Multivariate imputation by chained equations in R. Journal of Statistical Software, 45(3), 1–67.
World Health Organization (2022). Ambient (outdoor) air pollution. [link]. Acesso em: 2024-03-19.
