Improving automatic data extraction from financial statements with clustering analysis

  • Victor Ferraz Serasa Experian
  • Gabriel Olivato UFSCar
  • Igor Magollo UFSCar
  • Murilo Naldi UFSCar


The financial statement analysis is a fundamental part of the credit risk attribution process, producing documents that are valuable sources of information about companies’ economic and financial wealth. Large volumes of that type of document demand automatic data extraction, and locators drive the tools for that task. However, due to the lack of regulation, there is not a standard layout for such documents, which originates a variety of document structures. Such variety burdens the feature extraction tools, reducing their performance. Clustering analysis overcomes such burden by finding the best document clusters, allowing the development of fine-tuned locators for each cluster based on their main characteristics, which is the main objective of this work. We applied state-of-the-art clustering techniques, RNG-HDBSCAN*, FOSC and MustaCHE, over financial statements documents to assess their clusters and main structures, separate outliers, and analyze their main features. The result allows the specialists to define proper locators for each cluster, increasing the performance of the data extraction tools.

Palavras-chave: data science, clustering, feature extraction


