Extraction and enrichment of features to improve complaint text classification performance

  • Eduardo de Paiva CGU
  • Fernando Sola Pereira CGU

Resumo


No Brasil, os cidadãos podem fazer denúncias de irregularidades na Administração Pública. A classificação dessas denúncias necessita de informações que não estão nos seus textos. O objetivo desse artigo é propor uma metodologia para a extração e enriquecimentos de informações identificadas nos textos das denúncias. Essa metodologia fornece como saída um conjunto de dados estruturados capazes de caracterizar as denúncias. Para validar a proposta, foi realizado um estudo de caso. O estudo demonstrou que a utilização dos dados estruturados possibilitou uma melhora no desempenho da classificação das denúncias.

Referências

Alexandrino, M. and Paulo, V. (2006). Direito administrativo. Impetus. Bojanowski, P., Grave, E., Joulin, A., and Mikolov, T. (2017). Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics, 5:135–146.

Breiman, L. (2001). Random forests. Machine learning, 45(1):5–32. Chung, J., Gülçehre, C¸ ., Cho, K., and Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR, abs/1412.3555.

Coussement, K. and Van den Poel, D. (2008). Improving customer complaint management by automatic email classification using linguistic style features as predictors. Decision Support Systems, 44(4):870–882.

Devlin, J., Chang, M. W., Lee, K., and Toutanova, K. (2019). BERT: Pre-training of deep bidirectional transformers for language understanding. NAACL HLT 2019 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Proceedings of the Conference, 1(Mlm):4171– 4186.

Domingos, P. (2012). A few useful things to know about machine learning. Commun. ACM, 55(10):78–87.

Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2):179–211. Feldman, R., Sanger, J., et al. (2007). The text mining handbook: advanced approaches in analyzing unstructured data. Cambridge university press.

Hiemstra, D. (2000). A probabilistic justification for using tf x idf term weighting in information retrieval. Int. J. Digit. Libr., 3(2):131–139.

Hochreiter, S. and Schmidhuber, J. (1997). Long Short-Term Memory. Neural Computation, 9(8):1735–1780.

Karthikeyan, T., Sekaran, K., D., R., V., V. K., and M, B. J. (2019). Personalized content extraction and text classification using effective web scraping techniques. Int. J. Web Portals, 11(2):41–52.

Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324.

Li, Z. (2019). A classification retrieval approach for english legal texts. In 2019 International Conference on Intelligent Transportation, Big Data Smart City (ICITBS), pages 220–223.

Ling, W., Dyer, C., Black, A. W., and Trancoso, I. (2015). Two/too simple adaptations of Word2Vec for syntax problems. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1299–1304, Denver, Colorado. Association for Computational Linguistics.

Liu, C.-z., Sheng, Y.-x., Wei, Z.-q., and Yang, Y.-Q. (2018). Research of text classification based on improved tf-idf algorithm. In 2018 IEEE International Conference of Intelligent Robotic and Control Engineering (IRCE), pages 218–222.

Mikolov, T., Chen, K., Corrado, G., and Dean, J. (2013a). Efficient estimation of word representations in vector space.

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., and Dean, J. (2013b). Distributed representations of words and phrases and their compositionality. In Proceedings of the 26th International Conference on Neural Information Processing Systems Volume 2, NIPS’13, page 3111–3119, Red Hook, NY, USA. Curran Associates Inc.

Pennington, J., Socher, R., and Manning, C. D. (2014). Glove: Global vectors for word representation. In EMNLP, volume 14, pages 1532–1543.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, In NAACL HLT 2018 2018 L. (2018). Deep contextualized word representations. Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies Proceedings of the Conference, volume 1, pages 2227–2237, Stroudsburg, PA, USA. Association for Computational Linguistics. Radfort, A., Narasimhan, K., Salimans, T., and Sutskever, I. (2018). (OpenAI Transformer): Improving Language Understanding by Generative Pre-Training. OpenAI, pages 1–10.

Souza, F., Nogueira, R., and Lotufo, R. (2019). Portuguese named entity recognition using bert-crf. arXiv preprint arXiv:1909.10649.

Souza, F., Nogueira, R., and Lotufo, R. (2020). BERTimbau: pretrained BERT models for Brazilian Portuguese. In 9th Brazilian Conference on Intelligent Systems, BRACIS, Rio Grande do Sul, Brazil, October 20-23 (to appear).

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. (2017). Attention Is All You Need. Advances in Neural Information Processing Systems, 2017-Decem(Nips):5999–6009.

Wilcoxon, F. (1945). Individual comparisons by ranking methods. Biometrics Bulletin, 1(6):80–83.

Wu, L., Morstatter, F., and Liu, H. (2018). Slangsd: building, expanding and using a sentiment dictionary of slang words for short-text sentiment classification. Lang. Resour. Evaluation, 52(3):839–852.
Publicado
29/11/2021
PAIVA, Eduardo de; PEREIRA, Fernando Sola. Extraction and enrichment of features to improve complaint text classification performance. In: ENCONTRO NACIONAL DE INTELIGÊNCIA ARTIFICIAL E COMPUTACIONAL (ENIAC), 18. , 2021, Evento Online. Anais [...]. Porto Alegre: Sociedade Brasileira de Computação, 2021 . p. 338-349. ISSN 2763-9061. DOI: https://doi.org/10.5753/eniac.2021.18265.