Feature engineering vs. extraction: clustering Brazilian municipalities through spatial panel agricultural data via autoencoders
Resumo
This article compares the clustering of Brazilian municipalities according to their agricultural diversity using two approaches, one based on feature engineering and the other based on feature extraction using Deep Learning based on autoencoders and cluster analysis based on k-means and Self-Organizing Maps. The analyzes were conducted from panel data referring to IBGE’s annual estimates of Brazilian agricultural production between 1999 and 2018. Different structures of simple stacked undercomplete autoencoders were analyzed, varying the number of layers and neurons in each of them, including the latent layer. The asymmetric exponential linear loss function was also evaluated to cope with the sparse data. The results show that in comparison with the ground truth adopted, the autoencoder model combined with the k-means presented a superior result than the clustering of the raw data from the k-means, demonstrating the ability of simple autoencoders to represent from their latent layer important features of the data. Although the general accuracy is low, the results are promising, considering that we evaluated the most simple strategy for Deep Clustering.
Referências
Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2(PAMI-1):224-227.
Dress, K., Lessmann, S., and Mettenheim, H.-J. (2018). Residual value forecasting using asymmetric cost functions. International Journal of Forecasting, 34(4):551-565.
Du, G., Zhou, L., Yang, Y., Lü, K., and Wang, L. (2021). Deep multiple auto-encoder-based multi-view clustering. Data Science and Engineering, 6:323-338. 10.1007/s41019-021-00159-z.
Falissard, L., Faghreazzi, G., Howard, N., and Falissard, B. (2018). Deep clustering of longitudinal data. ArXiv.
Fatch, P., Masangano, C., Hilger, T., Jordan, I., Mambo, I., Francesca, J., Kamoto, M., Kalimbira, A., and Nuppenau, E.-A. (2021). Holistic agricultural diversity index as a measure of agricultural diversity: A cross-sectional study of smallholder farmers in Lilongwe district of Malawi. Agricultural Systems, 187:102991.
Genolini, C., Alacoque, X., Sentenac, M., and Arnaud, C. (2015). kml and kml3d: R packages to cluster longitudinal data. Journal of Statistical Software, 65(4):1-34.
Guo, X., Liu, X., Zhu, E., and Yin, J. (2017). Deep clustering with convolutional autoencoders. Lecture Notes in Computer Science, (10635):373-382. 10.1007/978-3-31970096-0 39.
Halkidi, M. and Vazirgiannis, M. (2008). A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29:773-786.
IBGE (2021). Tabelas 74, 94, 289, 291, 1612, 1613, 3939 e 3940: sistema IBGE de recuperação automática. Available at https://sidra.ibge.gov.br (2021/06/15).
Khatun, N. and Matin, M. A. (2020). A study on linex loss function with different estimating methods. Open Journal of Statistics, 10:52-63.
Kohonen, T. (2001). Self-Organizing Maps. Berlin: Springer.
Kuhn, H. W. (1955). The Hungarian method for the assignment problem. Naval Research Logistics Quarterly, 2(1-2):83-97.
LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436-444.
Min, E., Guo, X., Liu, Q., Zhang, G., Cui, J., and Long, J. (2018). A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access, 6:39501-39514. 10.1109/ACCESS.2018.2855437.
Mohammed, M., Alshanbari, H. M., and El-Bagoury, A.-A. H. (2022). Application of the linex loss function with a fundamental derivation of liu estimator. Computational Intelligence and Neuroscience, (2307911):-. Artificial Intelligence and Machine LearningDriven Decision-Making.
Rousseeuw, P. J. (1987). Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Computational and Applied Mathematics, 20:53-65.
Sales, C. and Rodrigues, R. (2019). Espaço rural brasileiro: diversificação e peculiaridades. Revista Espinhaço, 8(1):54-65.
Sambuichi, R., Galindo, E., Pereira, R., Constantino, M., and Rabetti, M. (2016). Diversidade da produção nos estabelecimentos da agricultura familiar no Brasil: uma análise econométrica baseada no cadastro da declaração de aptidão ao PRONAF (DAP). Technical report, Brasília: Rio de Janeiro.
Schneider, S. and Cassol, A. (2014). Diversidade e heterogeneidade da agricultura familiar no Brasil e algumas implicações para políticas públicas. Cadernos de Ciência & Tecnologia, 31(2):227-263.
Silva, M. A. S. d., Matos, L. N., Santos, F. E. d. O., Dompieri, M. H. G., and Moura, F. R. d. (2022). Tracking the connection between brazilian agricultural diversity and native vegetation change by a machine learning approach. IEEE Latin America Transactions, 20(11):2371-2380.
Song, C., Y, Y. H., Liu, F., Wang, Z., and Wang, L. (2014). Deep auto-encoder based clustering. Intelligent Data Analysis, 18(6):S65-S76. 10.3233/IDA-140709.
Teixeira, M. and Ribeiro, S. (2020). Agricultura e paisagens sustentáveis: a diversidade produtiva do setor agrícola de Minas Gerais, Brasil. Sustainability in Debate, 11(2):29-41.
Tisdell, C., Alauddin, M., Sarker, M., and Kabir, M. (2019). Agricultural diversity and sustainability: general features and Bangladeshi illustrations. Sustainability, 11:6004-6015.
Varian, H. R. (1975). A bayesian approach to real estate assessment. Studies in Bayesian Econometric and Statistics in Honor of Leonard J. Savage, 5:195-208.
Xu, C., Dai, Y., Lin, R., and Wang, S. (2020). Deep clustering by maximizing mutual information in variational auto-encoder. Knowledge-Based Systems, 205(106260). 10.1016/j.knosys.2020.106260.