From Text to Barcode: Inferring Product Identifiers in Electronic Invoices with Missing Information
Abstract
Electronic invoices are valuable sources of information for public administration, enabling price monitoring, fraud detection, and greater transparency. A recurring challenge, however, is the absence or inconsistency of structured product identifiers, such as barcodes (GTINs), which are essential for comparing products across transactions. Often, only short, noisy, and unstandardized textual descriptions are available. This work proposes a hybrid strategy for identifier inference, combining high-precision string-matching with interpretable machine learning models based on vectorized text representations. Results show that while string-matching is accurate, its coverage is limited; supervised classifiers expand coverage effectively, especially when using character-level ngrams. The proposed approach is integrated into an open-source fiscal mining tool, leveraging simple and efficient methods suitable for large-scale data processing.References
Araújo, L., Behr, A., and Schiavi, G. S. (2023). Adoção de business analytics na contabilidade. Revista Contabilidade e Finanças – USP, 34(93):e1771.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Chen, M., Mao, S., and Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2):171–209.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27.
da Cunha Panis, A., da Silva Filho Isidro, A., de Oliveira Carneiro, D. K., Montezano, L., Junior, P. C. R., and Sano, H. (2022). Inovação em compras públicas: Atividades e resultados no caso do robô alice da controladoria-geral da união. Cadernos Gestão Pública e Cidadania, 27(86):e83111.
de Angeli Neto, H. and Martinez, A. L. (2016). Nota fiscal de serviços eletrônica: uma análise dos impactos na arrecadação em municípios brasileiros. Revista de Contabilidade e Organizações, 10(26):49–62.
Domingos, P. and Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103–130.
Fan, J., Han, F., and Liu, H. (2014). Challenges of big data analysis. National Science Review, 1(2):293–314.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(1):3133–3181.
Gandomi, A. and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2):137–144.
Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edition.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
OECD (2017). Technology Tools to Tackle Tax Evasion and Tax Fraud. OECD Publishing.
Rahm, E. and Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3–13.
Xie, J., Sun, L., and Zhao, Y. F. (2025). On the data quality and imbalance in machine learning-based design and manufacturing—a systematic review. Engineering, 45:105–131.
Zhang, H. (2004). The optimality of naive bayes. In Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, pages 562–567.
Zhang, Z. (2022). Review on string-matching algorithm. In Proceedings of the 2022 International Conference on Science and Technology Ethics and Human Future (STEHF 2022), volume 144 of SHS Web of Conferences, pages 1–6. EDP Sciences.
Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32.
Chen, M., Mao, S., and Liu, Y. (2014). Big data: A survey. Mobile Networks and Applications, 19(2):171–209.
Cover, T. and Hart, P. (1967). Nearest neighbor pattern classification. IEEE Transactions on Information Theory, 13(1):21–27.
da Cunha Panis, A., da Silva Filho Isidro, A., de Oliveira Carneiro, D. K., Montezano, L., Junior, P. C. R., and Sano, H. (2022). Inovação em compras públicas: Atividades e resultados no caso do robô alice da controladoria-geral da união. Cadernos Gestão Pública e Cidadania, 27(86):e83111.
de Angeli Neto, H. and Martinez, A. L. (2016). Nota fiscal de serviços eletrônica: uma análise dos impactos na arrecadação em municípios brasileiros. Revista de Contabilidade e Organizações, 10(26):49–62.
Domingos, P. and Pazzani, M. (1997). On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29(2-3):103–130.
Fan, J., Han, F., and Liu, H. (2014). Challenges of big data analysis. National Science Review, 1(2):293–314.
Fernández-Delgado, M., Cernadas, E., Barro, S., and Amorim, D. (2014). Do we need hundreds of classifiers to solve real world classification problems? Journal of Machine Learning Research, 15(1):3133–3181.
Gandomi, A. and Haider, M. (2015). Beyond the hype: Big data concepts, methods, and analytics. International Journal of Information Management, 35(2):137–144.
Han, J., Kamber, M., and Pei, J. (2011). Data Mining: Concepts and Techniques. Morgan Kaufmann, 3rd edition.
Hastie, T., Tibshirani, R., and Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer, 2nd edition.
OECD (2017). Technology Tools to Tackle Tax Evasion and Tax Fraud. OECD Publishing.
Rahm, E. and Do, H. H. (2000). Data cleaning: Problems and current approaches. IEEE Data Engineering Bulletin, 23(4):3–13.
Xie, J., Sun, L., and Zhao, Y. F. (2025). On the data quality and imbalance in machine learning-based design and manufacturing—a systematic review. Engineering, 45:105–131.
Zhang, H. (2004). The optimality of naive bayes. In Proceedings of the Seventeenth International Florida Artificial Intelligence Research Society Conference, pages 562–567.
Zhang, Z. (2022). Review on string-matching algorithm. In Proceedings of the 2022 International Conference on Science and Technology Ethics and Human Future (STEHF 2022), volume 144 of SHS Web of Conferences, pages 1–6. EDP Sciences.
Published
2025-09-29
How to Cite
LEMOS, Carlos Filipe de Castro; SANTOS, Brucce Neves dos; MARCACINI, Ricardo Marcondes.
From Text to Barcode: Inferring Product Identifiers in Electronic Invoices with Missing Information. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 22. , 2025, Fortaleza/CE.
Anais [...].
Porto Alegre: Sociedade Brasileira de Computação,
2025
.
p. 1467-1478.
ISSN 2763-9061.
DOI: https://doi.org/10.5753/eniac.2025.12392.
