Aplicando algoritmos de clusterização para encontrar inconsistências em bases de dados fiscais
Abstract
Advancements in Geographic Information Systems and in digital governance have enabled many cities to implement digital and geocoded property databases. However, property registers have diverging information since decades old data was automatically fed into digital systems and remain in conflict with incoming more standardized registers. Such is the case of the property database in Fortaleza, where this study is based. An estimated 2048 registers on apartments buildings are currently inconsistent and require cleaning or normalizing. This paper presents how clustering algorithms can help find inconsistencies in property registries.References
Ankerst, Mihael ; M. Breunig, M. . K. H.-P. . S. J. (1999). Optics: ordering points to identify the clustering structure. In Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD’99), pages 49—-60, Philadelphia, PA.
Aprilia, H. and Agustiani, D. (2021). Application of data mining using the k-means algorithm in rural and urban land and building tax (pbb-p2) receivables data in bantul regency. Journal of Physics: Conference Series, 1823:012063.
Bishop, C. M. (2007). Pattern recognition and machine learning.
Carusi, C. and Bianchi, G. (2019). Scientific community detection via bipartite scholar/journal graph co-clustering. Journal of Informetrics, 13(1):354–386.
Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell., 1(2):224–227.
Eguino, H., Erba, D., Da Silva, E., De Oliveira, A., Piumetto, M., Iturre, T., and Rodríguez, A. (2020). Catastro, valoración inmobiliaria y tributación municipal: Experiencias para mejorar su articulación y efectividad. Informe del Banco Interamericano de Desarrollo (BID).
Ester, Martin ; Kriegel, H.-P. S. J. . X. X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 96(34):226—-231.
Ficklin, S., Dunwoodie, L., Poehlman, W., Watson, C., Roche, K., and Feltus, F. (2017). Discovering condition-specific gene co-expression patterns using gaussian mixture models: A cancer case study. Scientific Reports, 7:5.
Geyer, P., Schlüter, A., and Cisar, S. (2017). Application of clustering for the development of retrofit strategies for large building stocks. Advanced Engineering Informatics, 31:32–47.
Grubesic, T. H., Wei, R., and Murray, A. T. (2014). Spatial clustering overview and comparison: Accuracy, sensitivity, and computational expense. Annals of the Association of American Geographers, 104(6):1134–1156.
Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition.
Işeri, O. and Gursel Dino, I. (2022). Building Archetype Characterization Using K-Means Clustering in Urban Building Energy Models, pages 222–236.
Jordan, M. I. and Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260.
Kriegel, Hans-Peter; Kröger, P. S. J. Z. A. (2011). Density-based clustering. WIREs Data Mining and Knowledge Discovery, 26.
Leung, K. and Leckie, C. (2005). Unsupervised anomaly detection in network intrusion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, pages 333–342.
Medda, F. R. (2011). 04land value finance: Resources for public transport. Innovative land and property taxation, page 42.
Pu, G., Wang, L., Shen, J., and Dong, F. (2020). A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Science and Technology, 26(2):146–153.
Ranalli, M. and Rocci, R. (2014). Mixture models for ordinal data: a pairwise likelihood approach. Statistics and Computing, 26.
Thiprungsri, S. and Vasarhelyi, M. A. (2011). Cluster analysis for anomaly detection in accounting data: An audit approach. International Journal of Digital Accounting Research, 11.
Xu, Rui; Wunsch, D. (2005). Survey of clustering algorithms. Kdd, 16(3):645—-678.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). Birch: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD ’96, page 103–114, New York, NY, USA. Association for Computing Machinery.
Aprilia, H. and Agustiani, D. (2021). Application of data mining using the k-means algorithm in rural and urban land and building tax (pbb-p2) receivables data in bantul regency. Journal of Physics: Conference Series, 1823:012063.
Bishop, C. M. (2007). Pattern recognition and machine learning.
Carusi, C. and Bianchi, G. (2019). Scientific community detection via bipartite scholar/journal graph co-clustering. Journal of Informetrics, 13(1):354–386.
Davies, D. L. and Bouldin, D. W. (1979). A cluster separation measure. IEEE Trans. Pattern Anal. Mach. Intell., 1(2):224–227.
Eguino, H., Erba, D., Da Silva, E., De Oliveira, A., Piumetto, M., Iturre, T., and Rodríguez, A. (2020). Catastro, valoración inmobiliaria y tributación municipal: Experiencias para mejorar su articulación y efectividad. Informe del Banco Interamericano de Desarrollo (BID).
Ester, Martin ; Kriegel, H.-P. S. J. . X. X. (1996). A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd, 96(34):226—-231.
Ficklin, S., Dunwoodie, L., Poehlman, W., Watson, C., Roche, K., and Feltus, F. (2017). Discovering condition-specific gene co-expression patterns using gaussian mixture models: A cancer case study. Scientific Reports, 7:5.
Geyer, P., Schlüter, A., and Cisar, S. (2017). Application of clustering for the development of retrofit strategies for large building stocks. Advanced Engineering Informatics, 31:32–47.
Grubesic, T. H., Wei, R., and Murray, A. T. (2014). Spatial clustering overview and comparison: Accuracy, sensitivity, and computational expense. Annals of the Association of American Geographers, 104(6):1134–1156.
Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 3rd edition.
Işeri, O. and Gursel Dino, I. (2022). Building Archetype Characterization Using K-Means Clustering in Urban Building Energy Models, pages 222–236.
Jordan, M. I. and Mitchell, T. M. (2015). Machine learning: Trends, perspectives, and prospects. Science, 349(6245):255–260.
Kriegel, Hans-Peter; Kröger, P. S. J. Z. A. (2011). Density-based clustering. WIREs Data Mining and Knowledge Discovery, 26.
Leung, K. and Leckie, C. (2005). Unsupervised anomaly detection in network intrusion detection using clusters. In Proceedings of the Twenty-eighth Australasian conference on Computer Science-Volume 38, pages 333–342.
Medda, F. R. (2011). 04land value finance: Resources for public transport. Innovative land and property taxation, page 42.
Pu, G., Wang, L., Shen, J., and Dong, F. (2020). A hybrid unsupervised clustering-based anomaly detection method. Tsinghua Science and Technology, 26(2):146–153.
Ranalli, M. and Rocci, R. (2014). Mixture models for ordinal data: a pairwise likelihood approach. Statistics and Computing, 26.
Thiprungsri, S. and Vasarhelyi, M. A. (2011). Cluster analysis for anomaly detection in accounting data: An audit approach. International Journal of Digital Accounting Research, 11.
Xu, Rui; Wunsch, D. (2005). Survey of clustering algorithms. Kdd, 16(3):645—-678.
Zhang, T., Ramakrishnan, R., and Livny, M. (1996). Birch: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, SIGMOD ’96, page 103–114, New York, NY, USA. Association for Computing Machinery.
Published
2023-08-06
How to Cite
QUEIROZ, Virginia; FURTADO, Lara Sucupira; PINHEIRO, Vládia Celia.
Aplicando algoritmos de clusterização para encontrar inconsistências em bases de dados fiscais. In: BRAZILIAN WORKSHOP ON ARTIFICIAL INTELLIGENCE IN FINANCE (BWAIF), 2. , 2023, João Pessoa/PB.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2023
.
p. 120-131.
DOI: https://doi.org/10.5753/bwaif.2023.230762.
