Clustering of invoice items related to similar products
Abstract
Applications such as investigating prices charged in public purchases and possible irregularities require the identification of invoice items that refer to the same product. This is a challenging problem due to the lack of standardization of the textual descriptions of the products in the items. This article proposes and compares 4 methods for grouping electronic invoice items using topic modeling techniques and data like the measurement unit and the NCM (Comum nomenclature of Mercosul) code. The results indicate that the proposal allows grouping products with relatively simple descriptions and have potential to assist the grouping of items with more varied descriptions.
Keywords:
Machine learning for fraud and corruption detection, Natural Language Processing in public documents for government monitoring and transparency, Tools for fraud and corruption analysis and investigation, Methods for analyzing electronic invoices, Clustering
References
Angelov, D. (2020). Top2vec: Distributed representations of topics. [link].
Brasil (2021). Lei nº 14.133, de 1º de abril de 2021. [link]. Lei de Licitações e Contratos Administrativos.
Brinkmann, A., Baumann, N., and Bizer, C. (2024). Using llms for the extraction and normalization of product attribute values.
Kieckbusch, D. S. (2022). Scan-nf: a machine learning system for invoice product transaction classification through short-text processing. Master’s thesis, Univerty of Brasília (UnB).
Krieger, F., Drews, P., and Funk, B. (2023). Automated invoice processing: Machine learning-based information extraction for long tail suppliers. Intelligent Systems with Applications, 20:200285.
Novaes, L. P., Vianna, D., and da Silva, A. (2023). Modelagem de tópicos para a tarefa de recuperação de casos legais. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 128–140, Porto Alegre, RS, Brasil. SBC.
Paalman, J., Mullick, S., Zervanou, K., and Zhang, Y. (2019). Term based semantic clusters for very short text classification. In Mitkov, R. and Angelova, G., editors, Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 878–887, Varna, Bulgaria. INCOMA Ltd.
Silva, M. O., Costa, L. L., de Barros Bezerra, G. F., Gomide, L. D., Hott, H. R., Oliveira, G. P., Brandão, M. A., Lacerda, A., and Pappa, G. L. (2023). Análise de sobrepreço em itens de licitações públicas. Anais do XI Workshop de Computação Aplicada em Governo Eletrônico (WCGE 2023).
Yao, X., Sun, H., Li, S., and Lu, W. (2022). Invoice detection and recognition system based on deep learning. Security and Communication Networks, 2022(1):8032726.
Brasil (2021). Lei nº 14.133, de 1º de abril de 2021. [link]. Lei de Licitações e Contratos Administrativos.
Brinkmann, A., Baumann, N., and Bizer, C. (2024). Using llms for the extraction and normalization of product attribute values.
Kieckbusch, D. S. (2022). Scan-nf: a machine learning system for invoice product transaction classification through short-text processing. Master’s thesis, Univerty of Brasília (UnB).
Krieger, F., Drews, P., and Funk, B. (2023). Automated invoice processing: Machine learning-based information extraction for long tail suppliers. Intelligent Systems with Applications, 20:200285.
Novaes, L. P., Vianna, D., and da Silva, A. (2023). Modelagem de tópicos para a tarefa de recuperação de casos legais. In Anais do XXXVIII Simpósio Brasileiro de Bancos de Dados, pages 128–140, Porto Alegre, RS, Brasil. SBC.
Paalman, J., Mullick, S., Zervanou, K., and Zhang, Y. (2019). Term based semantic clusters for very short text classification. In Mitkov, R. and Angelova, G., editors, Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2019), pages 878–887, Varna, Bulgaria. INCOMA Ltd.
Silva, M. O., Costa, L. L., de Barros Bezerra, G. F., Gomide, L. D., Hott, H. R., Oliveira, G. P., Brandão, M. A., Lacerda, A., and Pappa, G. L. (2023). Análise de sobrepreço em itens de licitações públicas. Anais do XI Workshop de Computação Aplicada em Governo Eletrônico (WCGE 2023).
Yao, X., Sun, H., Li, S., and Lu, W. (2022). Invoice detection and recognition system based on deep learning. Security and Communication Networks, 2022(1):8032726.
Published
2024-10-14
How to Cite
DA SILVA, João Pedro D.; SOARES, Diogo; ZIBETTI, Andre Wüst; M. DOS SANTOS, Matheus; FILETO, Renato; WERNER, Simone Silmara.
Clustering of invoice items related to similar products. In: WORKSHOP ON DATA SCIENCE AGAINST CORRUPTION IN THE PUBLIC SECTOR (DS-COPS) - BRAZILIAN SYMPOSIUM ON DATABASES (SBBD), 39. , 2024, Florianópolis/SC.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2024
.
p. 273-279.
DOI: https://doi.org/10.5753/sbbd_estendido.2024.244219.
