Evaluating the Impact of Pre-clustering and Class Imbalance on Solar Flare Forecasting

  • Mirelle C. Bueno UNICAMP
  • Guilherme P. Coelho UNICAMP
  • Ana Estela A. da Silva UNICAMP
  • André L. S. Gradvohl UNICAMP


Among the phenomena that occur on the surface of the Sun, solar flares may cause several damages, from short circuits in power transmission lines to complete interruptions in telecommunications systems. In order to mitigate these effects, many works have been dedicated to the proposal of mechanisms capable of predicting the occurrence of solar flares. In this context, the present work sought to evaluate two aspects related to machine learning-based solar flare forecasting: (i) the impact of class imbalance in training datasets on the performance of the predictors; and (ii) whether the incorporation of a pre-clustering step prior to the classifiers training contributes to a better prediction.


Al-Ghraibah, A., Boucheron, L. E., and McAteer, R. T. J. (2015). A study of feature selection of magnetogram complexity features in an imbalanced solar flare prediction data-set. In IEEE, editor, Proc. of the 15th IEEE International Conference on Data Mining Workshop (ICDMW), page 557–564, Atlantic City, USA.

Argento, R. S. V. (2016). Utilização de Ensembles de Redes Neurais MLP para Previs˜ao de Explos˜oes Solares. Dissertação de Mestrado. Faculdade de Tecnologia, Universidade Estadual de Campinas.

Bobra, M. G. and Couvidat, S. (2015). Solar flare prediction using SDO/HMI vector magnetic field data with a machine-learning algorithm. The Astrophysical Journal, 798(2).

Camargos, R. C. (2016). Algoritmos Aglomerativos de Agrupamento Baseados em Teoria de Matrizes. Dissertação de Mestrado em Ciência da Computação. Faculdade Campo Limpo Paulista.

Cinto, T., Gradvohl, A. L. S., Coelho, G. P., and Silva, A. E. A. (2018). Daily solar data and sunspot region summary of 23-24 solar cycle. http://doi.org/10.5281/zenodo.1307495. Zenodo. Acessado em: 13-ago-2018.

Colak, T. and Qahwaji, R. (2009). Automated solar activity prediction: A hybrid computer platform using machine learning and solar imaging for automated prediction of solar flares. Space Weather, 7(6).

Doni, M. V. (2004). Análise de cluster: Métodos hierárquicos e de particionamento. Trabalho de Graduação. Faculdade de Computação e Informática, Universidade Presbiteriana Mackenzie.

Han, J. and Kamber, M. (2006). Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd edition.

Holman, G. D. and Benedict, S. (2007). Solar flare theory. http://hesperia.gsfc.nasa.gov/sftheory/index.htm. Acessado em: 01-jul-2018.

Jain, K. A. and Dubes, C. R. (1988). Algorithms for clustering data. Prentice Hall, 1st edition.

Kaufman, L. and Rousseeuw, J. P. (1990). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, 1st edition.

Kohavi, R. (1995). A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proc. of the International Joint Conference on Artificial Intelligence, page 1137–1145.

Li, R., Wang, H. N., Cui, Y. M., and Huang, X. (2011). Solar flare forecasting using learning vector quantity and unsupervised clustering techniques. Science China Physics, Mechanics and Astronomy, 54(8):1546–1552.

Lima, S. D. S. (2012). Tempestades Geomagnéticas: Origem e Consequência. Trabalho de Conclus˜ao de Graduação em Física. Centro de Ciências e Tecnologia, Universidade Estadual do Ceará.

Linden, R. (2009). Técnicas de agrupamento. Revista de Sistemas da Informação da FSMA, (4):18–36.

Liu, C., Deng, N., Liu, Y., Falconer, D., Goode, P. R., Denker, C., and Wang, H. (2005). Rapid change of delta spot structure associated with seven major flares. Astrophysical Journal, 1(622):722–736.

Mojena, R. (1977). Hierarchical grouping methods and stopping rules: An evaluation. The Computer Journal, 20:359–363.

Moura, M., Gonçalves, L., Sudré, C., Rodrigues, R., and Amaral Jr., A. P. T. (2010). Algoritmo de Gower na estimativa da divergência genética em germoplasma de pimenta. Horticultura Brasileira, 28(2):155–161.

Rahman, M. M. and Davis, D. N. (2013). Cluster based under-sampling for unbalanced cardiovascular data. In Proc. of the 2013 World Congress on Engineering, volume III, pages 1–6, London, UK.

Rowlett, R. (2013). Solar flare intensity. http://www.unc.edu/˜rowlett/units/scales/solar_flares.htm. Acessado em: 01-jul-2018.

Vale, N. M. (2005). Agrupamentos de Dados: Avaliação de Métodos e Desenvolvimento de Aplicativo para Análise de Grupos. Dissertação de Mestrado em Engenharia Elétrica. Pontifícia Universidade Católica do Rio de Janeiro.

Yen, S.-J. and Lee, Y.-S. (2006). Cluster-based sampling approaches to imbalanced data distributions. In Springer, editor, Lecture Notes in Computer Science (LNCS) - Proc. of the Data Warehousing and Knowledge Discovery Conference, volume 4081, page 427–436, Krakow,, Poland.

Zhao, Y., Karypis, G., and Fayyad, U. (2005). Hierarchical clustering algorithms for document datasets. Data Mining and Knowledge Discovery, 10(2):141–168.

BUENO, Mirelle C.; COELHO, Guilherme P.; DA SILVA, Ana Estela A.; GRADVOHL, André L. S.. Evaluating the Impact of Pre-clustering and Class Imbalance on Solar Flare Forecasting. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 15. , 2018, São Paulo. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2018 . p. 485-496. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2018.4441.