Improving automatic data extraction from financial statements with clustering analysis

Victor Ferraz; Gabriel Olivato; Igor Magollo; Murilo Naldi

doi:10.5753/kdmile.2020.11952

Victor Ferraz Serasa Experian
Gabriel Olivato UFSCar
Igor Magollo UFSCar
Murilo Naldi UFSCar

DOI: https://doi.org/10.5753/kdmile.2020.11952

Resumo

The financial statement analysis is a fundamental part of the credit risk attribution process, producing documents that are valuable sources of information about companies’ economic and financial wealth. Large volumes of that type of document demand automatic data extraction, and locators drive the tools for that task. However, due to the lack of regulation, there is not a standard layout for such documents, which originates a variety of document structures. Such variety burdens the feature extraction tools, reducing their performance. Clustering analysis overcomes such burden by finding the best document clusters, allowing the development of fine-tuned locators for each cluster based on their main characteristics, which is the main objective of this work. We applied state-of-the-art clustering techniques, RNG-HDBSCAN*, FOSC and MustaCHE, over financial statements documents to assess their clusters and main structures, separate outliers, and analyze their main features. The result allows the specialists to define proper locators for each cluster, increasing the performance of the data extraction tools.

Palavras-chave: data science, clustering, feature extraction

Referências

Araujo Neto, A. C., Nascimento, M. A., Sander, J., and Campello, R. J. G. B. Mustache: A multiple clustering hierarchies explorer. Proc. VLDB Endow. 11 (12): 2058–2061, Aug., 2018.

Araujo Neto, A. C., Sander, J., Campello, R., and Nascimento, M. Efficient computation and visualization of multiple density-based clustering hierarchies. IEEE Transactions on Knowledge and Data Engineering, 2019.

Assaf Neto, A. Estrutura e análise de balanços: um enfoque econômico-financeiro. Atlas, 2020.

Banco Central do Brasil. Diagnóstico da convergência às Normas Internacionais: IAS 1 - Presentation of financial statements. Banco Central do Brasil, Brasília/DF, 2010.

Brasil. Lei nº 6.404, de 15 de dezembro de 1976. Diário Oficial da União, 1976.

Brasil. Lei nº 11.638, de 28 de dezembro de 2007. Diário Oficial da União, 2007.

Brasil. Lei nº 11.941, de 27 de maio de 2009. Diário Oficial da União, 2009.

Campello, R. J. G. B., Moulavi, D., and Sander, J. Density-based clustering based on hierarchical density estimates. In PAKDD (2), J. Pei, V. S. Tseng, L. Cao, H. Motoda, and G. Xu (Eds.). Lecture Notes in Computer Science, vol. 7819. Springer, pp. 160–172, 2013.

Campello, R. J. G. B., Moulavi, D., Zimek, A., and Sander, J. A framework for semi-supervised and unsupervised optimal extraction of clusters from hierarchies. Data Mining and Knowledge Discovery 27 (3): 344–371, Nov, 2013.

Comitê de Pronunciamentos Contábeis. Pronunciamento Técnico CPC 21 (R1): Demonstração intermediária: Correlação às Normas Internacionais de Contabilidade - IAS 34 (IASB - BV 2011). Comitê de Pronunciamentos Contábeis, Brasília/DF, 2011a.

Comitê de Pronunciamentos Contábeis. Pronunciamento Técnico CPC 26 (R1): Apresentação das demonstrações.

contábeis: Correlação às Normas Internacionais de Contabilidade - IAS 1 (IASB - BV 2011). Comitê de Pronunciamentos Contábeis, Brasília/DF, 2011b.

Jain, A. K. Data clustering: 50 years beyond k-means. Pattern Recognition Letters 31 (8): 651 – 666, 2010. Award winning papers from the 19th International Conference on Pattern Recognition (ICPR).

Johnson, D., Xiong, C., Gao, J., and Corso, J. Comprehensive cross-hierarchy cluster agreement evaluation, 2013.

Madeira, R. d. O. C. Aplicação de técnicas de mineração de texto na detecção de discrepâncias em documentos fiscais. M.S. thesis, Fundação Getúlio Vargas, Rio de Janeiro/RJ, 2015.

Moura, M. F. Proposta de utilização de mineração de textos para seleção, classificação e qualificação de documentos. Tech. rep., Embrapa Informática Agropecuária. Dez., 2004.

Snow, M. Unsupervised document clustering with cluster topic identification. Tech. rep., Office for National Statistics. Abr., 2018.