A Four-Step Cascade Methodology to Classify MCN Codes Using NLP Techniques

  • Pedro Pinheiro UFPA
  • Luan Siqueira UFPA
  • Marcos Amaris UFPA

Abstract


The MCN is a regional nomenclature for categorizing goods adopted by Mercosur countries. This nomenclature divides products using 8 digits, separated into 4 parts, Chapter, Heading, Subheading and Item/Subitem. There are indications that about 30% of the goods shipped globally have the wrong code because it is a manual process. This work aims to develop a process to classify the textual descriptions of the products present in the Electronic Invoices (NF-e). The classification was done using Natural Language Processing (NLP) techniques and tested using 2 different machine learning algorithms, Support Vector Machine (SVM) and Naive Bayes. A database of 340,000 distinct products was used for the experiments. We divided the process into 4 classification models, made to classify the 4 parts of the MCN. The data was divided into 80% training and 20% testing, and we obtained an accuracy of 89% for a total of 98 classes of the first 2 digits, and 76% using a cascade technique to classify the 8 digits.

Keywords: Natural Language Processing, Machine Learning, Text Classification, Mercosul Common Nomenclature

References

Andre Dieb Martins, Bruno B. Albert, E. C. G. (2013). Classificador de textos otimizado utilizando lei de potencia para palavras raras. XXXI SIMPOSIO BRASILEIRO DE TELECOMUNICAÇÕES.

Bonfim, D. P., Moraes, D., Machado, H., Amorim, M. O., and Raimundini, S. L. (2012). Nota fiscal eletrônica: uma mudança de paradigma sob a perspectiva do fisco estadual. ConTexto, 12(21):17-28.

Brasil (2003). Emenda constitucional n. 42.

de Abreu Batista, R., Bagatini, D. D., and Frozza, R. (2018). Classificação automática de códigos ncm utilizando o algoritmo naïve bayes. iSys-Revista Brasileira de Sistemas de Informação, 11(2):4-29.

de Lima, R. R., Fernandes, A. M. R., Bombasar, J. R., da Silva, B. A., Crocker, P., and Leithardt, V. R. Q. (2022). An empirical comparison of portuguese and multilingual bert models for auto-classification of ncm codes in international trade. Big Data and Cognitive Computing, 6(1).

Ding, L., Fan, Z., and Chen, D. (2015). Auto-categorization of hs code using background net approach. Procedia Computer Science, 60:1462-1471.

Kadhim, A. I. (2019). Survey on supervised machine learning techniques for automatic text classification. Artificial Intelligence Review, 52(1):273-292.

Li, G. and Li, N. (2019). Customs classification for cross-border e-commerce based on text-image adaptive convolutional neural network. Electronic Commerce Research, 19(4):779-800.

Luppes, J., de Vries, A. P., and Hasibi, F. (2019). Classifying short text for the harmonized system with convolutional neural networks. Radboud University.

Neto, J. L., Santos, A. D., Kaestner, C. A., Alexandre, N., Santos, D., A, C. A., Alex, K., Freitas, A. A., and Parana, C. (2000). Document clustering and text summarization.

Orengo, V. M. and Huyck, C. R. (2001). A stemming algorithmm for the portuguese language. In spire, volume 8, pages 186-193.

Prati, R. C. (2006). Novas abordagens em aprendizado de máquina para a geração de regras, classes desbalanceadas e ordenação de casos. PhD thesis, Universidade de São Paulo.

Roberto Scalco, P., Klaold Lippi, M., and de Almeida, M. I. S. (2015). Preço e renda como determinantes da demanda por bens de luxo no brasil: Um estudo econométrico com produtos importados da nomenclatura comum do mercosul. Brazilian Journal of Management/Revista de Administração da UFSM, 8(3).

Russell, S. J. and Norvig, P. (2003). Instructor's solution manual for artificial intelligence: a modern approach.

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM computing surveys (CSUR), 34(1):1-47.

SEFAZ (2021). Sobre a nf-e.

Sousa, J. P. R. d. (2010). Impactos da utilização da nota fiscal eletrônica nas atividades de monitoramento e fiscalização do icms: um estudo na secretaria da fazenda do estado do ceará. Master's thesis, Universidade Federal do Ceará,.

Wang, J., Wang, Z., Zhang, D., and Yan, J. (2017). Combining knowledge with deep convolutional neural networks for short text classification. In IJCAI, volume 350.

Yu, H.-F., Ho, C.-H., Arunachalam, P., Somaiya, M., and Lin, C.-J. (2012). Product title classification versus text classification. Csie. Ntu. Edu. Tw, pages 1-25.
Published
2022-11-28
PINHEIRO, Pedro; SIQUEIRA, Luan; AMARIS, Marcos. A Four-Step Cascade Methodology to Classify MCN Codes Using NLP Techniques. In: NATIONAL MEETING ON ARTIFICIAL AND COMPUTATIONAL INTELLIGENCE (ENIAC), 19. , 2022, Campinas/SP. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2022 . p. 389-400. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2022.227652.

Most read articles by the same author(s)