Identifying Finest Machine Learning Algorithm for Climate Data Imputation in the State of Minas Gerais, Brazil

Authors

  • Lucas O. Bayma Universidade Federal de São João Del Rei
  • Marconi A. Pereira Universidade Federal de São João Del Rei

DOI:

https://doi.org/10.5753/jidm.2018.2044

Abstract

Climate prediction is a relevant activity for humanity and, for the success of the climate forecast, a good historical database is necessary. However, because of several factors, large historical data gaps are found at different meteorological stations, and studies to determine such missing weather values are still scarce. This work describes a study of a combination of several machine learning techniques to determine missing climatic values. This study extends our previous work, producing a computational framework, formed by three different methods: neural networks, regression bagged trees and random forest. Deep data analysis and a statistical study is conducted to compare these three methods. The study statistically demonstrated that the random forest technique was successful in obtaining missing climatic values for the state of Minas Gerais and can be widely used by the responsible agencies to improve their historical databases, consequently, their climate forecasts.

Downloads

Download data is not yet available.

References

Barbosa, M. and Carvalho, M. (2015). Sistemas de Armazenamento de Dados Observados do CPTEC/INP. Instituto Nacional de Pesquisas Espaciais.

Bayma, L. O. and Pereira, M. A. (2017). Comparison of machine learning techniques for the estimation of climate missing data in the state of minas gerais, brazil. Proceedings of the XVIII Brazilian Symposium on Geoinformatics, pages 283–294.

Breiman, L. (1998). Using convex pseudo-data to increase prediction accuracy. breast (Wis), 699(9):2.

Breiman, L. et al. (1996). Heuristics of instability and stabilization in model selection. The annals of statistics, 24(6):2350–2383.

Breiman, L., Friedman, J., Stone, C. J., and Olshen, R. A. (1984). Classification and regression trees. CRC press.

Carrano, E. G., Wanner, E. F., and Takahashi, R. H. (2011). A multicriteria statistical based comparison methodology for evaluating evolutionary algorithms. IEEE Transactions on Evolutionary Computation, 15(6):848–870.

Demirhan, H. and Renwick, Z. (2018). Missing value imputation for short to mid-term horizontal solar irradiance data. Applied Energy, 225:998–1012.

Draper, N. R. and Smith, H. (2014). Applied regression analysis. John Wiley & Sons.

Enders, C. K. (2010). Applied missing data analysis. Guilford press.

Fisher, R. A. (1919). Xv.—the correlation between relatives on the supposition of mendelian inheritance. Transactions of the royal society of Edinburgh, 52(02):399–433.

García-Laencina, P. J., Abreu, P. H., Abreu, M. H., and Afonoso, N. (2015). Missing data imputation on the 5-year survival prediction of breast cancer patients with unknown discrete values. Computers in biology and medicine, 59:125–133.

Gilat, A. and Subramaniam, V. (2009). Métodos numéricos para engenheiros e cientistas: uma introdução com aplicações usando o MATLAB. Bookman Editora.

Hegde, C., Wallace, S., Gray, K., et al. (2015). Using trees, bagging, and random forests to predict rate of penetration during drilling. In SPE Middle East Intelligent Oil and Gas Conference and Exhibition. Society of Petroleum Engineers.

Hyndman, R. J. and Koehler, A. B. (2006). Another look at measures of forecast accuracy. International journal of forecasting, 22(4):679–688.

Jordanov, I., Petrov, N., and Petrozziello, A. (2018). Classifiers accuracy improvement based on missing data imputation. Journal of Artificial Intelligence and Soft Computing Research, 8(1):31–48.

Lakshminarayan, K., Harp, S. A., and Samad, T. (1999). Imputation of missing data in industrial databases. Applied intelligence, 11(3):259–275.

Luengo, J., García, S., and Herrera, F. (2010). A study on the use of imputation methods for experimentation with radial basis function network classifiers handling missing attribute values: The good synergy between rbfns and eventcovering method. Neural Networks, 23(3):406–418.

Olcese, L. E., Palancar, G. G., and Toselli, B. M. (2015). A method to estimate missing aeronet aod values based on artificial neural networks. Atmospheric Environment, 113:140–150.

Pearson, K. (1992). On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling. In Breakthroughs in Statistics, pages 11–28. Springer.

Ripley, B. D. (2007). Pattern recognition and neural networks. Cambridge university press.

Saar-Tsechansky, M. and Provost, F. (2007). Handling missing values when applying classification models. Journal of machine learning research, 8(Jul):1623–1657.

Sapankevych, N. I. and Sankar, R. (2009). Time series prediction using support vector machines: a survey. IEEE Computational Intelligence Magazine, 4(2).

Sefidian, A. M. and Daneshpour, N. (2019). Missing value imputation using a novel grey based fuzzy c-means, mutual information based feature selection, and regression model. Expert Systems with Applications, 115:68–94.

Singh, P. (2016). Neuro-fuzzy hybridized model for seasonal rainfall forecasting: A case study in stock index forecasting. In Hybrid Soft Computing Approaches, pages 361–385. Springer.

Tang, F. and Ishwaran, H. (2017). Random forest missing data algorithms. Statistical Analysis and Data Mining: The ASA Data Science Journal, 10(6):363–377.

Valdiviezo, H. C. and Van Aelst, S. (2015). Tree-based prediction on incomplete data using imputation or surrogate decisions. Information Sciences, 311:163–181.

Wasserstein, R. L. and Lazar, N. A. (2016). The asa’s statement on p-values: context, process, and purpose.

Wei, R., Wang, J., Su, M., Jia, E., Chen, S., Chen, T., and Ni, Y. (2018). Missing value imputation approach for mass spectrometry-based metabolomics data. Scientific reports, 8(1):663.

Xiao, Z., Liang, S., Wang, J., Xie, D., Song, J., and Fensholt, R. (2015). A framework for consistent estimation of leaf area index, fraction of absorbed photosynthetically active radiation, and surface albedo from modis time-series data. IEEE Transactions on Geoscience and Remote Sensing, 53(6):3178–3197.

Yang, J. and Hu, M. (2018). Filling the missing data gaps of daily modis aod using spatiotemporal interpolation. Science of the Total Environment, 633:677–683.

Downloads

Published

2018-12-30

How to Cite

O. Bayma, L., & A. Pereira, M. (2018). Identifying Finest Machine Learning Algorithm for Climate Data Imputation in the State of Minas Gerais, Brazil. Journal of Information and Data Management, 9(3), 259. https://doi.org/10.5753/jidm.2018.2044

Issue

Section

GEOINFO2017